This is the third part of our example showing how open data datasets from the Open Data portal of Vienna can be used to obtain insights about the city. For the analysis we work the open and freely available Jupyter Notebooks software. In this part, we link pet statistics with further open datasets concerning population and income, and analyze correlations within this data.
Open Data Story¶
In the previous tutorial we set out to explore the dog statistics in Vienna and discovered that there is a significant variation in the number of dogs across different districts. In this tutorial we explore additional data sets trying to explain the differences between Vienna districts that may influence the dog statistics. Here is a set of questions we have in mind:
Research Questions¶
- Why some districts have more/less dogs than others?
- Where should I live if I want to have a dog in Vienna?
- Which dogs are in trend in the rich neighborhoods of Vienna?
- What is the dog situation in my district (revisited)?
Get the Data¶
Let’s repeat the same step we did in the previous tutorials to load the dog stats dataset.
# Load libraries
import pandas as pd # CSV file processing
import numpy as np # vector and matrix manipulation
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', skiprows=1, thousands=',', encoding='latin-1')
# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.columns = ["Breed_1", "Breed_2"]
data = pd.concat([data, separate_breeds], axis=1)
# Correct encoding for special characters in german alphabet
def to_utf(x):
return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x
data = data.applymap(to_utf)
# Aggregate
data = data.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data.columns = ["DISTRICT_CODE", "Dog Breed", "Dog Count"]
# Check the top of the table to make sure the dataset is loaded correctly
data.head()
Get More Data¶
Districts of any city (town) are not homogeneous in terms of area and population. Also in Vienna some districts are bigger in terms of territory, more densely populated or wealthier than other district. These factors shall also have an effect on the dog population within the specified regions.
Now we need to fetch this kind of data (population, area and average income per district in Vienna) in order to check its correlation with our dog stats data. https://www.data.gv.at is the biggest Austrian open data portal. Let’s have a look there!
Population Statistics per District¶
Population (Bevölkerung) seems just right for the general category, filtering on Vienna, CSV data format and community districts (Gemeindebezirke) boils down the search results to a single dataset!
Next, we just repeat the same steps we used previously to obtain the dog stats dataset:
# Load the csv file from the open data portal
# Dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_viebevlkerungseit1869wien/resource/c9e9d0cc-8f71-4521-b598-17d71fab2f59
data_path = 'https://www.wien.gv.at/statistik/ogd/vie_101.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data_population = pd.read_csv(data_path, delimiter=';', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly
data_population.head()
The dataset title informs us that the dataset contains data dating from 1869 year. We would prefer more fresh population statistics to be able to compare with the dog stats that we have dating from 2016 12 01.
Unfortunately it is not clear from the dataset description which time periods it contains:
„VIE-Bevölkerung in den 23 Wiener Gemeindebezirken nach Geschlecht seit 1869
Zuletzt aktualisiert 31. März 2016
Erstellt 31. März 2016“
Let us examine the REF_DATE column, which seems to contain the reference date here.
# Check unique column values
data_population['REF_DATE'].unique()
The last measurement is from 2015 (same as on Wikipedia: https://de.wikipedia.org/wiki/Wiener_Gemeindebezirke#cite_note-11). Let’s take it.
# Slice dataset: filter specific rows and columns
population_2015 = data_population[(data_population['REF_DATE'] == 20150101)][['DISTRICT_CODE', 'POP_TOTAL']]
# Sort column
population_2015.sort_values(ascending=False, by='POP_TOTAL').head()
# Aggregate counts
district_dogs = data.groupby('DISTRICT_CODE')['Dog Count'].aggregate(np.sum)
# Join Data Frame and Series objects on the key column
dog_vs_people = population_2015.join(district_dogs, on='DISTRICT_CODE')
dog_vs_people.sort_values(ascending=False, by='POP_TOTAL').head()
dog_vs_people[['POP_TOTAL', 'Dog Count']].corr()
As expected, dog population is positively correlated with human population (dogs usually live together with people in Vienna).
Therefore, we need to consider the district population when comparing the number of dogs across districts.
2. Where should I live if I want to have a dog in Vienna?¶
dog_vs_people['Dogs_per_person'] = dog_vs_people['Dog Count'] / dog_vs_people['POP_TOTAL']
# Sort by column
dog_vs_people.sort_values(ascending=True, by='Dogs_per_person').round(3).head()
5th and 15th districts have the smallest proportion of dogs with respect to the number of inhabitants (18 dogs per 1000 people). We do not yet know why but it is not likely that dog-haters would agree to live closer to each other. There must be some other factors, such as necessary conditions required to have a dog, e.g. dog playgrounds, that influence the decision to have or not to have a dog.
# Sort by column
dog_vs_people.sort_values(ascending=False, by='Dogs_per_person').round(3).head()
# Load library for visualisation
import seaborn as sns
# Command to show plots in notebook
%matplotlib inline
ax = sns.barplot(palette='colorblind', data=dog_vs_people, y=dog_vs_people.DISTRICT_CODE, x=dog_vs_people.Dogs_per_person, orient='h')
ax.set(ylabel='District', xlabel='Dogs_per_person')
22nd and 21st districts have the biggest dog populations relative to the number of inhabitants in the district: 49 and 47 dogs per 1000 people respectively. Meaning that if you live in one of this districts you are more likely to have a dog.
Do these districts provide more favorable conditions for the dog owners? Or is it the kind of people that prefer to live in these areas also like to have dogs?
Let us explore the average income per district and whether it has an influence on the dog stats.
Get More Data¶
Income Statistics per District¶
We integrate another dataset for our dog statistics analysis, which luckily follows the same logic (data format) as the previous two datasets, since it is from the same publisher (wien.gv.at).
This dataset was relatively easy to find when searching with the keyword ‚income‘ (Einkommen) on the Austrian State Open Data portal https://www.data.gv.at.
# Load the csv file from the open data portal
# Dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_viewirtschaftwienergemeindebezirke1/resource/c1ad52a3-ca5c-44fc-a945-b6459f73ed98
data_path = 'https://www.wien.gv.at/statistik/ogd/vie_502.csv'
# The data format seems to be consistent for all the datasets from this publisher
data_income = pd.read_csv(data_path, delimiter=';', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly
data_income.head()
# Check unique column values
data_income['REF_DATE'].unique()
The last measurement is from 2014. Let’s take it.
# Slice dataset: filter specific rows and columns
income_2014 = data_income[(data_income['REF_DATE'] == 20141231)][['DISTRICT_CODE', 'AVERAGE_INCOME_TOTAL']]
# Sort column
income_2014.sort_values(ascending=False, by='AVERAGE_INCOME_TOTAL').head()
All right, this is common knowledge 😉 The wealthiest are 1st and 13th districts of Vienna.
income_2014.sort_values(ascending=True, by='AVERAGE_INCOME_TOTAL').head()
The poorest are the 15th and 20th districts.
# Plot pandas data frame
ax = sns.barplot(data=income_2014,palette='colorblind', y='DISTRICT_CODE', x='AVERAGE_INCOME_TOTAL', orient='h')
ax.set(ylabel='District', xlabel='Average Income (thousands)')
Now let’s join all our datasets with district stats.
# Merge data frames on the key column
dog_vs_people_vs_income = dog_vs_people.merge(income_2014, on='DISTRICT_CODE')
dog_vs_people_vs_income.sort_values(ascending=False, by='AVERAGE_INCOME_TOTAL').head()
dog_vs_people_vs_income.corr()
There is no strong correlation between the average income and the number of dogs. Everyone likes to have a dog. Maybe a different kind of dog though…
Is there a correlation between a dog breed and the income of its owners?
3. Which dogs are in trend in the rich neighborhoods of Vienna?¶
We need to go back to the breed-level district statistics to answer this question.
# Filter out specific rows
filtered_dogs = data.loc[data['Dog Breed']!='Unbekannt']
filtered_dogs.head()
Next, we need the relative counts instead of the absolute counts for the breeds to be able to compare across districts.
district_stats = filtered_dogs.merge(dog_vs_people, on='DISTRICT_CODE')
district_stats['Breed_Percent'] = ((district_stats['Dog Count_x'] / district_stats['Dog Count_y']) * 100).astype(int)
# Drop 0 values
district_stats = district_stats[(district_stats['Breed_Percent']>0)]
district_stats.sort_values(ascending=False, by='Breed_Percent').head()
# Create a “pivot” table based on column values unfolding the original table stats
unfolded_breeds = district_stats.pivot_table(index='DISTRICT_CODE', columns='Dog Breed', values='Breed_Percent').fillna(0).astype(int)
unfolded_breeds.reset_index(level=0, inplace=True)
unfolded_breeds.head()
breeds_vs_income = unfolded_breeds.merge(income_2014, on='DISTRICT_CODE')
c = breeds_vs_income.corr()
s = c.unstack()
s.sort_values(ascending=False)['AVERAGE_INCOME_TOTAL']
That is it! Labradors and Toy-Pudels are positively correlated with higher incomes. Staff, Pit-Bull and Chihuahuas with lower.
4. Open Data for Local Communities¶
Let us analyse district statistics on the example of the 2nd district (Leopoldstadt).
You can replace the district code in the variable below to see the stats for your own district.
district = 90200
# Get the slice of the joint dataset according to the search query
dog_vs_people_vs_income.head()
district_dogs = dog_vs_people_vs_income.loc[dog_vs_people_vs_income.DISTRICT_CODE==district]
district_dogs
Let us compare this stats with the averages across all Vienna districts.
dog_vs_people_vs_income.describe()
2nd district is among the largest districts in Vienna in terms of population (in the 4th quantile: 95,986 < 101,702 < 189,713).
It is below average with respect to the official income statistics (in the 2nd quantile: 19,236 < 19,887 < 22,505).
There is a single dog per 40 inhabitants (25 dogs/1,000 people) which is the average in Vienna.
Lessons Learned¶
Dogs in Vienna¶
Based on the data available we were able to provide comprehensive answers to the set of research questions proposed in the introduction.
- The dogs are distributed propotional to the number of people across the districts.
- However, some districts seem to have fewer dogs than expected (5th and 15th) and some more (21th and 22th), which points to other factors that may influence the decision to acquire a dog pet, such as availability of dog playgrounds in the area or ability to sustain the dog, i.e. income of the owner.
- Labradors, Toy-Pudels, Schnauzer, Parson-Russell Terriers, Weimaraner, Flat Coated Retrievers, Magyar Vizsla and West Highland White Terriers are the favorites of the upscale districts in Vienna. This may correlate with the dog prices as well.
- Leopoldstadt is a big relatively poor district with an adequate proportion of dogs. Statistics on population and average income provides us additional insights into the situation in our neighborhouds, which may be useful for other types of analysis as well.
Open Data Quality¶
1. Descriptions do not provide enough information on what kind of data the dataset contains, e.g. the date span in our example.
2. Formatting is consistent for the same publisher, which is the good news. Especially if it is a big publisher with many datasets, such as governmental agency.
Process¶
1. Find datasets, e.g. CSV files from open data portals
2. Refine: identify column separator, thousands separator, rows to skip, string encoding, etc.
3. Aggregate: group by different attributes, e.g. district or type, and sum up the counts.
4. Join datasets on the common key, e.g. district code identifier.
5. Check correlation between the columns.
Inspired by¶
- Detail Analysis of various Hospital factors on Kaggle.
- Wheelan, Charles J. Naked Statistics. 2013