This is a sample script showing how open data can be analysed and demonstrated using Jupyter Notebooks and Declarative Widgets. We take the dog statistics data in Vienna as a sample use case to demonstrate common approaches to analyse open data. The final dashboard and an interactive development environment (IDE) with all the tutorial notebooks are available from our temporal Jupyter Notebook Server.

Open Data Story¶

It is useful to define a set of possible research questions that define the goal of the data study and refine them along the way since the availablity of data suggests possible ways to combine and explore it.

Research Questions¶

Which Vienna districts are most fond of Wiener dogs?
How many Wiener dogs are there in my district?

This time we do not only find answers to our questions, but also create a web dashboard with interactive visualization to share our findings with others.

Get the Data¶

We described how to load and preprocess the dataset in the previous post. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.

This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“

In [198]:

# Load libraries
import pandas as pd # CSV file processing
import numpy as np # vector and matrix manipulation

# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', skiprows=1, thousands=',', encoding='latin-1')

# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460

# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.columns = ["Breed_1", "Breed_2"]
data = pd.concat([data, separate_breeds], axis=1)

# Correct encoding for special characters in german alphabet
def to_utf(x):
    return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x   
data = data.applymap(to_utf)

# Aggregate
data = data.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data.columns = ["District", "Dog_Breed", "Dog_Count"]

# Check the top of the table to make sure the dataset is loaded correctly 
data.head()

Out[198]:

	District	Dog_Breed	Dog_Count
0	90100	Afghanischer Windhund	1
1	90100	Amerikanischer Pit-Bullterrier	1
2	90100	Amerikanischer Staffordshire-Terrier	5
3	90100	Australian Shepherd Dog	3
4	90100	Australian Terrier	1

Show the Data¶

Interactive Table¶

In [199]:

# Load library for the interactive visualizations
import declarativewidgets
declarativewidgets.init()

Import widgets

In [200]:

%%html
<link rel="import" href="urth_components/urth-viz-table/urth-viz-table.html" is='urth-core-import'>
<link rel="import" href="urth_components/paper-input/paper-input.html" is='urth-core-import' package='PolymerElements/paper-input'>

Write functions to load and process data in the table

In [202]:

# Match pattern
def filter_by_pattern(df, pattern):
    """Filter a DataFrame so that it only includes rows where the Dog Breed
    column contains pattern, case-insensitive.
    """
    return df[df['Dog_Breed'].str.contains(pattern, case=False)]

# Load data
def dogs_table(pattern=''):
    """Build a DataFrame.   
    """
    # Use match pattern
    df = data.pipe(filter_by_pattern, pattern)     
    return df

In [218]:

%%html
<template is="urth-core-bind">
    <paper-input value="{{pattern}}" label="Filter by dog breed" ></paper-input>
</template>

<template is="urth-core-bind">

    <urth-core-function ref="dogs_table"  
                        arg-pattern="{{pattern}}" 
                        result="{{dogs_table}}" 
                        limit="1600 "
                        delay="500" 
                        auto>
    </urth-core-function>
    
    <urth-viz-table datarows="{{ dogs_table.data }}" 
                    rows-visible="5" 
                    selection="{{dog_selection}}" 
                    columns="{{ dogs_table.columns }}" 
                    selection-as-object>
    </urth-viz-table>
    
</template>

Interactive Bar Chart¶

In [204]:

# Create Multi-index
district_stats = data.set_index(['District', 'Dog_Breed'])
# Calculate percentages
breed_percents = (district_stats.div(district_stats.sum(axis=0, level=0), level=0) * 100).round(1).reset_index()
# Rename column
breed_percents = breed_percents.rename(columns = {'Dog_Count':'Dog_Percent'})
# Preview
breed_percents.head()

Out[204]:

	District	Dog_Breed	Dog_Percent
0	90100	Afghanischer Windhund	0.2
1	90100	Amerikanischer Pit-Bullterrier	0.2
2	90100	Amerikanischer Staffordshire-Terrier	1.1
3	90100	Australian Shepherd Dog	0.6
4	90100	Australian Terrier	0.2

In [206]:

breed = 'Dackel'
# Filter
breed_districts = breed_percents[(breed_percents['Dog_Breed'] == breed)]
# Remove column
breed_districts = breed_districts.drop('Dog_Breed', axis=1)
# Sort
breed_districts = breed_districts.sort_values(ascending=False, by='Dog_Percent')
# Rename column
breed_districts = breed_districts.rename(columns = {'Dog_Percent':'Percent_of_' + breed})
breed_districts.head()

Out[206]:

	District	Percent_of_Dackel
454	90400	3.6
1971	91500	3.3
777	90700	3.2
2266	91700	3.2
573	90500	3.1

Create function to load percents per district given the breed

In [215]:

# Filter data
def dogs_bar_chart(breed='Dackel'):
    """Build a DataFrame.   
    """
    # Filter
    df = breed_percents[(breed_percents['Dog_Breed'] == breed)]
    # Use match pattern
#     df = breed_percents.pipe(filter_by_pattern, breed)
    # Remove column
    df = df.drop('Dog_Breed', axis=1)
    # Sort
    df = df.sort_values(ascending=False, by='Dog_Percent')
    # Rename column
    df = df.rename(columns = {'Dog_Percent':'Percent_of_' + breed})  
    return df

Import bar chart widget

In [216]:

%%html
<link rel="import" href="urth_components/urth-viz-bar/urth-viz-bar.html" is='urth-core-import'>

In [217]:

%%html
<template is="urth-core-bind">
    <urth-core-function ref="dogs_bar_chart"  
                        arg-breed="{{dog_selection.Dog_Breed}}" 
                        result="{{df}}" 
                        limit="1600 "
                        delay="500" 
                        auto>
    </urth-core-function>
    <urth-viz-bar xlabel="Districts" ylabel="% to the total number of dogs in the district" datarows="{{df.data}}" columns="{{df.columns}}"></urth-viz-bar>
</template>

Lessons Learned¶

Dogs in Vienna¶

Based on the data available we were able to provide comprehensive answers to the set of research questions proposed in the introduction.

The true fans of Wiener dogs live in the 4th district of Vienna.
Wiener dogs are underreprestented in Leoplodstadt (2nd district). They constitute only 2% of the dog population.

Steps¶

1. Find datasets, e.g. CSV files from open data portals
2. Refine: identify column separator, thousands separator, rows to skip, string encoding, etc.
3. Aggregate: group by different attributes, e.g. district or type, and sum up the counts.
+ 4. Show the row data table for the user to be able to interact with the data.
5. Calculate proportions to the total sum in the group.
6. Slice: filter out rows, e.g. by district or type.
+ 7. Show sorted stats as a bar chart.

Prerequisites¶

To run this script on a local machine you need:

Python 3.4.
pandas
numpy
jupyter_declarativewidgets

Inspired by¶

Health Inspections Dashboard
tmpnb: deploy temporal Jupyter Notebook server
Wheelan, Charles J. Naked Statistics. 2013