Skip to content

Using Apache Spark SQL, Spark ML, Pandas to analyse and predict using the Chicago crime dataset

License

Notifications You must be signed in to change notification settings

ernest-kiwele/chicago-crime-analysis-apache-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Chicago crime dataset analysis


This notebook is a Spark and Python learner's to perform data analysis on some real-world data set.

In this notebook, I am capriciously using Spark, Pandas, Matplotlib, Seaborn without any meaningful distinction of purpose. The point is:

  • Perform data reading, transforming, and querying using Apache Spark
  • Visualize using existing Python libraries. Matplotlib will remain when I know to do with it all that I'm currenly using Seaborn for.
  • Where interoperation between Spark and Matplotlib is a hindrance, I use Pandas and Numpy.

This will be evolutionary and I hope that a few weeks from now, it will look much better.


Where to find the data?

This dataset can be found either at the original dataset website: https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e or on Kaggle, here: https://www.kaggle.com/djonafegnem/chicago-crime-data-analysis (I used this one as it came in a compressed archive)


How to run this and what to run it on?


I wrote this on Apache Spark 2.3.0. The entire notebook was executed on a single machine using the pyspark shell without problems.

Spark can be downloaded from https://spark.apache.org/downloads.html To run this notebook, one certainly also needs Pandas, Numpy, Matplotlib and other libraries installed. For me, all that was sorted out using Anaconda: https://www.anaconda.com/download/

Here are some important parameters:

  • Executor count: 4
  • Executor Memory: 4G
  • Driver Memory: 8G

This may not be necessary, but some data frames are being cached and performance degrades remarkably when the percentage of cached RDDs drops.

So, the command with which the notebook was launched is:

pyspark --driver-memory 8g --executor-memory 4g --master local[4]

About

Using Apache Spark SQL, Spark ML, Pandas to analyse and predict using the Chicago crime dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published