In this part we continue working with the data from Vienna Open Data portal using Jupyter Notebooks. Having loaded and pre-processed the data in the previous preliminary part of our example, we continue with extracting information from a single open dataset on dogs registered in Vienna. Ever wondered which Vienna districts are most dog friendly?
Dogs in Vienna. Part 1: Open Data Analysis Tutorial
Open Data Story¶
It is useful to define a set of possible research questions that define the goal of the data study and refine them along the way since the availablity of data suggests possible ways to combine and explore it.
Questions¶
- How many dogs are there in Vienna?
- Which dogs do Viennese people like the most?
- Where should I live if I do not like dogs?
- What is the dog situation in my district?
Get the Data¶
We described how to load and preprocess the dataset in the previous post. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.
This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“
# Load libraries
import pandas as pd # CSV file processing
import numpy as np # vector and matrix manipulation
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', skiprows=1, thousands=',', encoding='latin-1')
# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.columns = ["Breed_1", "Breed_2"]
data = pd.concat([data, separate_breeds], axis=1)
# Correct encoding for special characters in german alphabet
def to_utf(x):
return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x
data = data.applymap(to_utf)
# Aggregate
data = data.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data.columns = ["District", "Dog Breed", "Dog Count"]
# Check the top of the table to make sure the dataset is loaded correctly
data.head()
data.shape
Our dataset contains 3463 rows and 3 columns.
The district codes range from 90100 to 92300 indicating the 23 districts of Vienna (Wiener Bezirke).
Now the data is ready to be analysed!
sum(data['Dog Count'])
Almost 56 thousand dogs are registered in Vienna.
2. Which dog breeds are the most popular in Vienna?¶
Lets have a look at our dataset closer.
# Check the top counts
sorted_count = data.sort_values(['Dog Count'], ascending=False)
sorted_count.head()
Most of the dog breeds is unfortunately unknown (Unbekannt). It is not clear though, whether we do not know the breed of a dog or they do not know their breed themselves.
Let’s exclude the unknown and check how many different breeds we have in total.
# Filter out specific rows
filtered_dogs = sorted_count.loc[data['Dog Breed']!='Unbekannt']
# Check unique values in one of the columns
unique_breeds = filtered_dogs['Dog Breed'].unique()
len(unique_breeds)
We have 320 different dog breeds in Vienna. How cool is that!
By aggregating over districts we obtain a rating for the most popular dog breeds in whole Vienna:
breeds = filtered_dogs.groupby('Dog Breed')['Dog Count'].aggregate(np.sum)
# Filter, sort and trim pandas Series
top_dogs = breeds.sort_values(ascending=False)[:10]
# Load library for visualisation
import seaborn as sns
# Command to show plots in notebook
%matplotlib inline
# Plot pandas Series
ax = sns.barplot(palette='colorblind', x=top_dogs.values, y=list(top_dogs.index))
ax.set(ylabel='Breed', xlabel='Number of dogs')
Labradors and Chihuahuas are the top dog-pets among Viennese people.
Wiener dog (Dackel) landed at position #8 in our Viennese dogs ranking. Sad… 🙁
3. Where should I live if I do not like dogs?¶
Now let’s do the same trick but on the level of districts instead of breeds. Instead we want to aggregate dogs of all breeds for each of the districts of Vienna separately to obtain dogs per district counts.
district_dogs = data.groupby('District')['Dog Count'].aggregate(np.sum)
# Filter, sort and trim pandas Series
top_dogs = district_dogs.sort_values(ascending=False)
# Plot pandas Series
ax = sns.barplot(palette='colorblind', y=list(top_dogs.index), x=top_dogs.values, orient='h', order=list(top_dogs.index))
ax.set(ylabel='District', xlabel='Number of dogs')
OK. 22nd and 21st district of Vienna are the most rich in dogs (8,5 and 7 thousands respectively). 10th, 23rd and 11th districts follow in the dog ranking.
There are fewer dogs in 1st and 8th districts of Vienna (also, 4th, 6th and 7th).
What could be the possible reasons for such variation in the number of dogs in different districts of Vienna (500-8,500 dogs per district)? We shall need more data to understand this…
4. Open Data for Local Communities¶
Let us analyse district statistics on the example of the 2nd district (Leopoldstadt).
You can replace the district code in the variable below to see the stats for your own district.
district = 90200
district_dogs[district]
2545 dogs live in Leopoldstadt.
Is it many or not? Let us compare it with the average number of dogs across all districts of Vienna.
district_dogs.describe()
It is indeed very close to the average number of dogs per Vienna district. Let us see what are the most common breeds in Leopoldstadt.
dogs_district = filtered_dogs.loc[(data['District']==district)][['Dog Breed', 'Dog Count']]
top_dogs_district = dogs_district.sort_values(ascending=False, by='Dog Count')[:10]
ax = sns.barplot(data=top_dogs_district,palette='colorblind', x='Dog Count', y='Dog Breed')
ax.set(ylabel='Breed', xlabel='Number of dogs')
Lessons Learned¶
Dogs in Vienna¶
Based on the data available we were able to provide comprehensive answers to the set of research questions proposed in the introduction.
- There are almost 56 thousand dogs in Vienna of 320 different breeds.
- Labradors and Chihuahuas are the top choices of Viennese people.
- If you do not like dogs you might want to live in the 1st or 8th districts, where there are fewer dogs (everyone wants to live in the 1st or 8th districts anyway!)
- Leopoldstadt has an average number of dogs (2,545).
Steps¶
1. Find datasets, e.g. CSV files from open data portals
2. Refine: identify column separator, thousands separator, rows to skip, string encoding, etc.
3. Slice: filter out rows, e.g. by district or type.
4. Sort data by count (visualize ranking as a barplot).
5. Aggregate: group by different attributes, e.g. district or type, and sum up the counts.
Prerequisites¶
To run this script on a local machine you need:
- Python 3.4.
- pandas
- numpy
- seaborn (pip install)
Inspired by¶
- Detail Analysis of various Hospital factors
- Wheelan, Charles J. Naked Statistics. 2013
This code is on GitHub!