This is a sample script showing how open data datasets from the Vienna’s Open Data portal can be analysed and visualized using Jupyter Notebooks. We take the dog statistics data in Vienna as a sample use case to demonstrate common approaches to analysing open data. This is a first part of our data story, in which we focus on data loading and pre-processing.

Get the Data¶

Let’s load a dataset and make sure it is loaded correctly, so that we can use it for our analysis. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.

This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“

1. Delimiter¶

In [17]:

# Load the library for CSV file processing
import pandas as pd
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1')
data.head()

Out[17]:

	Hunderassen in Wien \| Breed of Dogs in Vienna	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date
1	AT1	AT13	AT113	90100	.	1010	Afghanischer Windhund	1	20161201
2	AT1	AT13	AT113	90100	.	1010	Amerikanischer Pit-Bullterrier	1	20161201
3	AT1	AT13	AT113	90100	.	1010	Amerikanischer Staffordshire-Terrier	4	20161201
4	AT1	AT13	AT113	90100	.	1010	Amerikanischer Staffordshire-Terrier / Mischling	1	20161201

Oops the first row of our CSV file contains the name of the table, which we should skip.

2. Skip rows¶

In [18]:

# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly 
data.head()

Out[18]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date
0	AT1	AT13	AT113	90100	.	1010	Afghanischer Windhund	1	20161201
1	AT1	AT13	AT113	90100	.	1010	Amerikanischer Pit-Bullterrier	1	20161201
2	AT1	AT13	AT113	90100	.	1010	Amerikanischer Staffordshire-Terrier	4	20161201
3	AT1	AT13	AT113	90100	.	1010	Amerikanischer Staffordshire-Terrier / Mischling	1	20161201
4	AT1	AT13	AT113	90100	.	1010	Australian Shepherd Dog	3	20161201

The top rows look just fine. But what about the rest?

3. Dataset statistics¶

In [3]:

data.shape

Out[3]:

(5740, 9)

Our dataset contains 5740 rows and 9 columns.

In [4]:

# Check the column types to make sure the dataset is loaded correctly
data.dtypes

Out[4]:

NUTS1                object
NUTS2                object
NUTS3                object
DISTRICT_CODE         int64
SUB_DISTRICT_CODE    object
Postal_CODE           int64
Dog Breed            object
Anzahl               object
Ref_Date              int64
dtype: object

Count (Anzahl) is not recognized as numeric data. We shall fix this!

4. Numeric data¶

In [20]:

# Check the rows that are not recognized as numeric
data[data.Anzahl.apply(lambda x: not x.isnumeric())]

Out[20]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date
1914	AT1	AT13	AT113	91000	.	1100	Unbekannt	1,51	20161201
2210	AT1	AT13	AT113	91100	.	1110	Unbekannt	1,029	20161201
4879	AT1	AT13	AT113	92100	.	1210	Unbekannt	2,066	20161201
5347	AT1	AT13	AT113	92200	.	1220	Unbekannt	2,46	20161201
5715	AT1	AT13	AT113	92300	.	1230	Unbekannt	1,215	20161201

Hmmm that looks interesting. 1,51 of a dog. What could that possibly mean?

Maybe 0s in the end of thousands got accidentally droped?
This does indeed look like a plausible explanation for these numbers. Lets try to fix this.

In [21]:

# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1, thousands=',')
# Check the column types to make sure the dataset is loaded correctly
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5740 entries, 0 to 5739
Data columns (total 9 columns):
NUTS1                5740 non-null object
NUTS2                5740 non-null object
NUTS3                5740 non-null object
DISTRICT_CODE        5740 non-null int64
SUB_DISTRICT_CODE    5740 non-null object
Postal_CODE          5740 non-null int64
Dog Breed            5740 non-null object
Anzahl               5740 non-null int64
Ref_Date             5740 non-null int64
dtypes: int64(4), object(5)
memory usage: 403.7+ KB

Looks good: count is recognized as integer (Anzahl int64).
But I want to make sure the numbers were translated correctly. Lets find these rows.

In [22]:

# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]

Out[22]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date
1914	AT1	AT13	AT113	91000	.	1100	Unbekannt	151	20161201

Oh no that did not work magic. We need to fix these cases manually.

In [23]:

# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]

Out[23]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date
1914	AT1	AT13	AT113	91000	.	1100	Unbekannt	1510	20161201

5. Strings¶

Strings (textual data) can be also tricky to handle. Let’s have a closer look at the column containing dog breeds.

In [24]:

data['Dog Breed'].unique()

Out[24]:

array([u'Afghanischer Windhund', u'Amerikanischer Pit-Bullterrier',
       u'Amerikanischer Staffordshire-Terrier', ...,
       u'Windhund / Amerikanischer Staffordshire-Terrier',
       u'Wolfsspitz / Labrador Retriever', u'Zwergspitz / Dackel'], dtype=object)

It seems that the mixed breeds (different breeds of the parents) are entered using sign ‚/‘. See row 3: Amerikanischer Staffordshire-Terrier / Mischling. Let’s separate such strings into several columns to be able to aggregate this kind of data easily.

In [25]:

# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.head()

Out[25]:

	0	1
0	Afghanischer Windhund	None
1	Amerikanischer Pit-Bullterrier	None
2	Amerikanischer Staffordshire-Terrier	None
3	Amerikanischer Staffordshire-Terrier	Mischling
4	Australian Shepherd Dog	None

OK. Let’s give names to the new columns and add them to our dataset.

In [15]:

separate_breeds.columns = ["Breed_1", "Breed_2"]
data_breeds = pd.concat([data, separate_breeds], axis=1)
# Check the bottom of the table for a change
data_breeds.tail()

Out[15]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date	Breed_1	Breed_2
5735	AT1	AT13	AT113	92300	.	1230	ZwergschÃ¤ferhund / Mischling	1	20161201	ZwergschÃ¤ferhund	Mischling
5736	AT1	AT13	AT113	92300	.	1230	Zwergschnauzer	7	20161201	Zwergschnauzer	None
5737	AT1	AT13	AT113	92300	.	1230	Zwergspitz	14	20161201	Zwergspitz	None
5738	AT1	AT13	AT113	92300	.	1230	Zwergspitz / Dackel	1	20161201	Zwergspitz	Dackel
5739	AT1	AT13	AT113	92300	.	1230	Zwergspitz / Mischling	9	20161201	Zwergspitz	Mischling

Oops some of the characters are not recognized correctly, we shall fix that!

In [16]:

# Correct encoding for special characters in german alphabet
def to_utf(x):
    return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x   
data_breeds_corrected = data_breeds.applymap(to_utf)
data_breeds_corrected.tail()

Out[16]:

	NUTS1	NUTS2	NUTS3	DISTRICT_CODE	SUB_DISTRICT_CODE	Postal_CODE	Dog Breed	Anzahl	Ref_Date	Breed_1	Breed_2
5735	AT1	AT13	AT113	92300	.	1230	Zwergschäferhund / Mischling	1	20161201	Zwergschäferhund	Mischling
5736	AT1	AT13	AT113	92300	.	1230	Zwergschnauzer	7	20161201	Zwergschnauzer	None
5737	AT1	AT13	AT113	92300	.	1230	Zwergspitz	14	20161201	Zwergspitz	None
5738	AT1	AT13	AT113	92300	.	1230	Zwergspitz / Dackel	1	20161201	Zwergspitz	Dackel
5739	AT1	AT13	AT113	92300	.	1230	Zwergspitz / Mischling	9	20161201	Zwergspitz	Mischling

6. Understand your data¶

In [17]:

# Check descriptive statistics
data_breeds_corrected.describe()

Out[17]:

	DISTRICT_CODE	Postal_CODE	Anzahl	Ref_Date
count	5740.000000	5740.000000	5740.000000	5740.0
mean	91355.000000	1135.500000	9.743728	20161201.0
std	668.274742	66.827474	61.095023	0.0
min	90100.000000	1010.000000	1.000000	20161201.0
25%	90900.000000	1090.000000	1.000000	20161201.0
50%	91400.000000	1140.000000	2.000000	20161201.0
75%	92000.000000	1200.000000	5.000000	20161201.0
max	92300.000000	1230.000000	2460.000000	20161201.0

The district and postal codes range from 90100 (1010) to 92300 (1230) indicating the 23 districts of Vienna (Wiener Bezirke).

In [18]:

data_breeds_corrected['Ref_Date'].unique()

Out[18]:

array([20161201])

All rows describe the data for a single reference date: 2016 12 01. Since the date format is not explicitly specified, it is not clear though whether it is the 1st of December or the 12th of January.

Essentially the dataset boils down to the information: District | Dog Breed | Dog Count

To simplify further analysis we drop information about the mixed breeds and take the first breed as the main dog breed.

In [19]:

# Load the library for vector/matrix manipulations
import numpy as np
data_breed_1 = data_breeds_corrected.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data_breed_1

Out[19]:

	DISTRICT_CODE	Breed_1	Anzahl
0	90100	Afghanischer Windhund	1
1	90100	Amerikanischer Pit-Bullterrier	1
2	90100	Amerikanischer Staffordshire-Terrier	5
3	90100	Australian Shepherd Dog	3
4	90100	Australian Terrier	1
5	90100	Basset Hound	1
6	90100	Beagle	8
7	90100	Beauceron	1
8	90100	Bergamasker Hirtenhund	1
9	90100	Berner Sennenhund	1
10	90100	Bernhardiner	1
11	90100	Bichon Frisé	1
12	90100	Bolonka Zwetna	3
13	90100	Border-Collie	5
14	90100	Bosnischer Rauhhaariger Laufhund	1
15	90100	Boston Terrier	2
16	90100	Boxer	5
17	90100	Bracke	2
18	90100	Bulldog	1
19	90100	Bullterrier	2
20	90100	Cairn-Terrier	1
21	90100	Cao de Agua Portugues	2
22	90100	Cavalier King Charles Spaniel	3
23	90100	Chihuahua kurzhaariger Schlag	12
24	90100	Chihuahua langhaariger Schlag	1
25	90100	Chow-Chow	2
26	90100	Cockerpoo	2
27	90100	Collie	2
28	90100	Dackel	14
29	90100	Dalmatiner	1
…	…	…	…
3433	92300	Sheltie	21
3434	92300	Shiba Inu	12
3435	92300	Shih Tzu	33
3436	92300	Siberian Husky	31
3437	92300	Soft-Coated Wheaten Terrier	1
3438	92300	Spanischer Windhund	7
3439	92300	Spinone	1
3440	92300	Spitz	21
3441	92300	Staffordshire-Bullterrier	16
3442	92300	Steirische Rauhhaarige Hochgebirgsbracke	1
3443	92300	Tervueren	1
3444	92300	Thai Ridgeback	1
3445	92300	Tibet-Spaniel	1
3446	92300	Tibet-Terrier	9
3447	92300	Unbekannt	1215
3448	92300	Weimaraner	2
3449	92300	Weißer Schweizer Schäferhund	8
3450	92300	Welsh Corgi Cardigan	1
3451	92300	Welsh Terrier	3
3452	92300	West Highland White Terrier	40
3453	92300	Whippet	11
3454	92300	Windhund	1
3455	92300	Wolfsspitz	4
3456	92300	Yorkshire Terrier	87
3457	92300	Zwergpinscher	60
3458	92300	Zwergpudel	14
3459	92300	Zwergschnauzer	7
3460	92300	Zwergschäferhund	1
3461	92300	Zwergspitz	24
3462	92300	Österreichischer Pinscher	3

3463 rows × 3 columns

Now the data is ready for analysis!

Lessons Learned¶

Open Data Quality¶

1. Descriptions are usually absent or incomplete.
You need to look inside the file and spend some time to understand what the dataset contains.
2. Column names in the header are cryptic, e.g., NUTS3.
3. Correct formatting is not granted either.
You need to figure out the correct way to load the dataset including delimiters, encoding etc.

Steps¶

1. Find a dataset, e.g. a CSV file from an open data portal.
2. Refine:
2.1. Process CSV: identify column separator, thousands separator, rows to skip, string encoding, etc.
2.2. Identify numeric data, e.g. sums or counts, that can be compared and aggregated.
2.3. Split strings into several columns, when necessary.
2.4. Correct some of the data values manually, when necessary.

Prerequisites¶

To run this script (Download) on a local machine you need:

Python 3.4.
pandas
numpy

Inspired by¶

Kaggle project: Detail Analysis of various Hospital factors
Wheelan, Charles J. Naked Statistics. 2013