This is a sample script showing how open data can be analysed and demonstrated using Jupyter Notebooks and Declarative Widgets. We take the dog statistics data in Vienna as a sample use case to demonstrate common approaches to analyse open data. The final dashboard and an interactive development environment (IDE) with all the tutorial notebooks are available from our temporal Jupyter Notebook Server.
Open Data Story¶
It is useful to define a set of possible research questions that define the goal of the data study and refine them along the way since the availablity of data suggests possible ways to combine and explore it.
Research Questions¶
- Which Vienna districts are most fond of Wiener dogs?
- How many Wiener dogs are there in my district?
This time we do not only find answers to our questions, but also create a web dashboard with interactive visualization to share our findings with others.
Get the Data¶
We described how to load and preprocess the dataset in the previous post. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.
This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“
# Load libraries
import pandas as pd # CSV file processing
import numpy as np # vector and matrix manipulation
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', skiprows=1, thousands=',', encoding='latin-1')
# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.columns = ["Breed_1", "Breed_2"]
data = pd.concat([data, separate_breeds], axis=1)
# Correct encoding for special characters in german alphabet
def to_utf(x):
return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x
data = data.applymap(to_utf)
# Aggregate
data = data.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data.columns = ["District", "Dog_Breed", "Dog_Count"]
# Check the top of the table to make sure the dataset is loaded correctly
data.head()
# Load library for the interactive visualizations
import declarativewidgets
declarativewidgets.init()
Import widgets
%%html
<link rel="import" href="urth_components/urth-viz-table/urth-viz-table.html" is='urth-core-import'>
<link rel="import" href="urth_components/paper-input/paper-input.html" is='urth-core-import' package='PolymerElements/paper-input'>
Write functions to load and process data in the table
# Match pattern
def filter_by_pattern(df, pattern):
"""Filter a DataFrame so that it only includes rows where the Dog Breed
column contains pattern, case-insensitive.
"""
return df[df['Dog_Breed'].str.contains(pattern, case=False)]
# Load data
def dogs_table(pattern=''):
"""Build a DataFrame.
"""
# Use match pattern
df = data.pipe(filter_by_pattern, pattern)
return df
%%html
<template is="urth-core-bind">
<paper-input value="{{pattern}}" label="Filter by dog breed" ></paper-input>
</template>
<template is="urth-core-bind">
<urth-core-function ref="dogs_table"
arg-pattern="{{pattern}}"
result="{{dogs_table}}"
limit="1600 "
delay="500"
auto>
</urth-core-function>
<urth-viz-table datarows="{{ dogs_table.data }}"
rows-visible="5"
selection="{{dog_selection}}"
columns="{{ dogs_table.columns }}"
selection-as-object>
</urth-viz-table>
</template>
Interactive Bar Chart¶
# Create Multi-index
district_stats = data.set_index(['District', 'Dog_Breed'])
# Calculate percentages
breed_percents = (district_stats.div(district_stats.sum(axis=0, level=0), level=0) * 100).round(1).reset_index()
# Rename column
breed_percents = breed_percents.rename(columns = {'Dog_Count':'Dog_Percent'})
# Preview
breed_percents.head()
breed = 'Dackel'
# Filter
breed_districts = breed_percents[(breed_percents['Dog_Breed'] == breed)]
# Remove column
breed_districts = breed_districts.drop('Dog_Breed', axis=1)
# Sort
breed_districts = breed_districts.sort_values(ascending=False, by='Dog_Percent')
# Rename column
breed_districts = breed_districts.rename(columns = {'Dog_Percent':'Percent_of_' + breed})
breed_districts.head()
Create function to load percents per district given the breed
# Filter data
def dogs_bar_chart(breed='Dackel'):
"""Build a DataFrame.
"""
# Filter
df = breed_percents[(breed_percents['Dog_Breed'] == breed)]
# Use match pattern
# df = breed_percents.pipe(filter_by_pattern, breed)
# Remove column
df = df.drop('Dog_Breed', axis=1)
# Sort
df = df.sort_values(ascending=False, by='Dog_Percent')
# Rename column
df = df.rename(columns = {'Dog_Percent':'Percent_of_' + breed})
return df
Import bar chart widget
%%html
<link rel="import" href="urth_components/urth-viz-bar/urth-viz-bar.html" is='urth-core-import'>
%%html
<template is="urth-core-bind">
<urth-core-function ref="dogs_bar_chart"
arg-breed="{{dog_selection.Dog_Breed}}"
result="{{df}}"
limit="1600 "
delay="500"
auto>
</urth-core-function>
<urth-viz-bar xlabel="Districts" ylabel="% to the total number of dogs in the district" datarows="{{df.data}}" columns="{{df.columns}}"></urth-viz-bar>
</template>
Lessons Learned¶
Dogs in Vienna¶
Based on the data available we were able to provide comprehensive answers to the set of research questions proposed in the introduction.
- The true fans of Wiener dogs live in the 4th district of Vienna.
- Wiener dogs are underreprestented in Leoplodstadt (2nd district). They constitute only 2% of the dog population.
Steps¶
1. Find datasets, e.g. CSV files from open data portals
2. Refine: identify column separator, thousands separator, rows to skip, string encoding, etc.
3. Aggregate: group by different attributes, e.g. district or type, and sum up the counts.
+ 4. Show the row data table for the user to be able to interact with the data.
5. Calculate proportions to the total sum in the group.
6. Slice: filter out rows, e.g. by district or type.
+ 7. Show sorted stats as a bar chart.
Prerequisites¶
To run this script on a local machine you need:
- Python 3.4.
- pandas
- numpy
- jupyter_declarativewidgets
Inspired by¶
- Health Inspections Dashboard
- tmpnb: deploy temporal Jupyter Notebook server
- Wheelan, Charles J. Naked Statistics. 2013