I’m new to exploring datasets, but I’ve already found some great public data online. First, can I say that Google Dataset Search is great? I have been enjoying many of the datasets I’ve found there. Additionally, organizations with public data like the Chicago Data Portal are interesting to check out. I’m happy to live in a city that provides a lot of its data online for free. Do you know if your city has public data available? I discovered some of this data via sites like data.world as well. With playful curiosity and the tools in this article, you will be well equipped to mine insights from any data.
Below, I will highlight three datasets that I recently stumbled into.
- Chicago Red Light Camera Violation Data
- USDA Nutritional Data (2015)
- Job Computerisation Forecast Data
Installation and Startup
New to Jupyter, pandas or pip?
Welcome to the club. If you need a run-down on using pip, I wrote this post to help sort it out. If you want a superb, gentle and funny introduction to pandas, Jupyter, and data analysis, this PyCon session is the best tutorial I have found, by far.
To run the scripts shown in this post, you must: (1) install the three libraries below to run in a Jupyter notebook (recommended) OR (2) run these plots from the command line and view them as a saved image. To do that, just install pandas and matplotlib.
First, install libraries with pip. Enter in command prompt or terminal:
pip install pandas
pip install matplotlib
python3 -m pip install jupyter
Then, run Jupyter Notebook
Note the spikes in tickets per day in 2016 and 2017, and the amount of tickets issued in 2016. Less tickets were issued in 2017 and 2018, after more allegations surfaced about this troublesome red light camera program. Jupyter Notebook with code for both these charts can be found here.
Chicago’s Shady Red Light Camera Program
2,573,339 violations occurred from July 2014 – March 2019. Personally, I was issued 3 red light camera tickets over the last 5 years. I charted out the tickets by year, month and day. The below code was used for the left image above, in which I plotted each day of red light camera violations in Chicago over the past 5 years.
import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdates import matplotlib.patches as mpatches """ plotting red light tickets with matplotlib - Python Marketer
Analysis: We can see 2016 is the leader for the past 5 years in amount of violations that occurred. The city of Chicago has drawn criticism for unethical practices in its camera program to reap more revenue. In 2017, the city approved a settlement of “nearly $40 million after a previous class-action lawsuit alleged that the program had violated 1.2 million motorists’ due process rights.” (Source)
“A Tribune analysis of more than 4 million tickets issued since 2007 and a deeper probe of individual cases reveal clear evidence that a series of sudden, unexplainable spikes in Chicago’s network of 380 cameras were caused by faulty equipment, human tinkering or both. Chicago transportation officials say they had no knowledge of the wild swings in ticketing until they were told by the Tribune — even though City Hall legally required the camera vendor to watch for the slightest anomaly in ticketing patterns every day. Many of the spikes lasted weeks.” (Source)
This USDA data was updated in 2015. It provides many common foods and their nutrient levels, such as carbohydrates and iron, for breads, fruits, vegetables, beverages and most foods you can find in a grocery store. I decided to see which foods contained the most: Sugar, Catbohydrates and Iron. The pandas nlargest method made this a breeze.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_excel('ABBREV.xlsx') df = df.sort_values(by=['Sugar_Tot_(g)'], ascending=False) df = df[df.Shrt_Desc.str.contains(pat='CEREALS,', na=False, case=False)] df = df.nlargest(n=15, columns='Sugar_Tot_(g)') df.plot.barh(x='Shrt_Desc', y='Sugar_Tot_(g)', color='DarkBlue', legend=False) plt.xlabel(xlabel='Grams of Sugar') plt.ylabel(ylabel='Foods') plt.title('Most Sugar, Cereals - USDA', fontsize=14) plt.gca().invert_yaxis()
“Research suggests that as many as 80 percent of people in the world don’t have enough iron in their bodies.” -World Health Organization
Analysis: Instant oatmeal and cereals are the highest in sugar content. Stay away from them if possible. If you need to boost your iron levels, beef and pork are good options. If you don’t eat meat, certain spices and wheat bran/flakes contain high levels of iron. Seaweed is also a good, iron rich snack. Ultimately, there are lots of ways to get iron in your diet besides meat.
This data orignates from a 2013 University of Oxford study and the United States Bureau of Labor Statistics. It provides two key values: probability (%) that a job will be computerised within 1-2 decades, and the number of jobs that currently exist for that profession in each state.
import pandas as pd import matplotlib.pyplot as plt import matplotlib.patches as mpatches """ plotting occupational automation risk - Python Marketer Find most common Illinois jobs affected and at highest risk of automation. Source Data: https://data.world/quanticdata/occupation-and-salary-by-state-and-likelihood-of-automation/workspace/query?queryid=51754b08-7414-4f76-8a31-8b5c75166146 plt.text matplotlib docs: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.text.html """ file_name = 'high-risk-in-your-st-occupation-and-salary-by-state-and-likelihood-of-automation-QueryResult.csv' df = pd.read_csv(file_name) df = df.sort_values(by=['probability'], ascending=False) df = df.nlargest(30, 'Illinois') df = df[df['probability'] >= 0.95] # gets "highest risk" subset df.plot.barh(x='occupation', y='Illinois', color='DarkBlue', legend=False, figsize=(15,10)) # add corresponding percent over each bar probs = list(df['probability']) for i, v in enumerate(probs): percent = str(v).replace('0.','') + '%' plt.text(x=v+3, y=i+.25, s=percent, color='white', fontweight='bold', fontsize=26) white_patch = mpatches.Patch(color='white', label='Est. % Probability of Computerisation') plt.legend(handles=[white_patch], fontsize='xx-large') plt.xticks(fontsize=18, rotation=0) plt.yticks(fontsize=18) plt.gca().invert_yaxis() plt.xlabel('Est. Number of Jobs At Risk in Illinois', fontsize=18) plt.ylabel('Occupation', fontsize=18) plt.title('Most Common Illinois Jobs at "Highest Risk" of Being Computerised Within 15 Years', fontsize=20) plt.annotate('Source: Carl Benedikt Frey and Michael A. Osborne (2013 University of Oxford Study)', (0,0), (0, -60), xycoords='axes fraction', textcoords='offset points', va='top') plt.annotate('Source: data.world/quanticdata & Bureau of Labor Statistics', (0,0), (0, -80), xycoords='axes fraction', textcoords='offset points', va='top') plt.savefig('Most Common Jobs at "High Risk" of Being Computerised.png', bbox_inches='tight')
Analysis: It’s strange to think that these common jobs, which currently make up a large portion of our economy, will likely be computerised some day. They all seem to be entry-level manual tasks. The promise of automation is that it frees humans up to focus on more meaningful or complex problems. How will the 125,000 cashiers in Illinois adapt when that is no longer a way to earn money?
The Plot Thickens…
I’ve barely scratched the surface in my experience with datasets, but already found some very interesting data to inspect and play with. These data visualization skills are adaptable to many situations. All you need is good data to bring to life. Stay curious, my friends.