Get the Data¶
Let’s load a dataset and make sure it is loaded correctly, so that we can use it for our analysis. It is often not that trivial as it may seem and involves a lot of data wrangling and debugging in order to find and eliminate possible errors or inconsistences in the dataset.
This step should not be underestimated since it defines the final result of our data analysis.
Remember: „Garbage in, garbage out!“
1. Delimiter¶
# Load the library for CSV file processing
import pandas as pd
# Load the csv file from the open data portal
# dataset description: https://www.data.gv.at/katalog/dataset/stadt-wien_anzahlderhundeprobezirkderstadtwien/resource/b8d97349-c993-486d-b273-362e0524f98c
data_path = 'https://www.wien.gv.at/finanzen/ogd/hunde-wien.csv'
# Look up the row file and specify the dataset format, e.g. delimiters
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1')
data.head()
Oops the first row of our CSV file contains the name of the table, which we should skip.
2. Skip rows¶
# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1)
# Check the top of the table to make sure the dataset is loaded correctly
data.head()
The top rows look just fine. But what about the rest?
3. Dataset statistics¶
data.shape
Our dataset contains 5740 rows and 9 columns.
# Check the column types to make sure the dataset is loaded correctly
data.dtypes
Count (Anzahl) is not recognized as numeric data. We shall fix this!
4. Numeric data¶
# Check the rows that are not recognized as numeric
data[data.Anzahl.apply(lambda x: not x.isnumeric())]
Hmmm that looks interesting. 1,51 of a dog. What could that possibly mean?
Maybe 0s in the end of thousands got accidentally droped?
This does indeed look like a plausible explanation for these numbers. Lets try to fix this.
# Reload dataset
data = pd.read_csv(data_path, delimiter=';', encoding='latin-1', skiprows=1, thousands=',')
# Check the column types to make sure the dataset is loaded correctly
data.info()
Looks good: count is recognized as integer (Anzahl int64).
But I want to make sure the numbers were translated correctly. Lets find these rows.
# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]
Oh no that did not work magic. We need to fix these cases manually.
# Correct individual values in the dataset
data.loc[1914, 'Anzahl'] = 1510
data.loc[5347, 'Anzahl'] = 2460
# Filter raws based on multiple columns
data[(data['Dog Breed']=='Unbekannt') & (data['Postal_CODE']==1100)]
5. Strings¶
Strings (textual data) can be also tricky to handle. Let’s have a closer look at the column containing dog breeds.
data['Dog Breed'].unique()
It seems that the mixed breeds (different breeds of the parents) are entered using sign ‚/‘. See row 3: Amerikanischer Staffordshire-Terrier / Mischling. Let’s separate such strings into several columns to be able to aggregate this kind of data easily.
# Carefully select the string separator, including spaces!
separate_breeds = data['Dog Breed'].str.split(' / ', expand=True)
separate_breeds.head()
OK. Let’s give names to the new columns and add them to our dataset.
separate_breeds.columns = ["Breed_1", "Breed_2"]
data_breeds = pd.concat([data, separate_breeds], axis=1)
# Check the bottom of the table for a change
data_breeds.tail()
Oops some of the characters are not recognized correctly, we shall fix that!
# Correct encoding for special characters in german alphabet
def to_utf(x):
return x.encode('latin-1').decode('utf8') if isinstance(x, str) else x
data_breeds_corrected = data_breeds.applymap(to_utf)
data_breeds_corrected.tail()
6. Understand your data¶
# Check descriptive statistics
data_breeds_corrected.describe()
The district and postal codes range from 90100 (1010) to 92300 (1230) indicating the 23 districts of Vienna (Wiener Bezirke).
data_breeds_corrected['Ref_Date'].unique()
All rows describe the data for a single reference date: 2016 12 01. Since the date format is not explicitly specified, it is not clear though whether it is the 1st of December or the 12th of January.
Essentially the dataset boils down to the information: District | Dog Breed | Dog Count
To simplify further analysis we drop information about the mixed breeds and take the first breed as the main dog breed.
# Load the library for vector/matrix manipulations
import numpy as np
data_breed_1 = data_breeds_corrected.groupby(['DISTRICT_CODE', 'Breed_1'])['Anzahl'].aggregate(np.sum).reset_index()
data_breed_1
Now the data is ready for analysis!
Lessons Learned¶
Open Data Quality¶
1. Descriptions are usually absent or incomplete.
You need to look inside the file and spend some time to understand what the dataset contains.
2. Column names in the header are cryptic, e.g., NUTS3.
3. Correct formatting is not granted either.
You need to figure out the correct way to load the dataset including delimiters, encoding etc.
Steps¶
1. Find a dataset, e.g. a CSV file from an open data portal.
2. Refine:
2.1. Process CSV: identify column separator, thousands separator, rows to skip, string encoding, etc.
2.2. Identify numeric data, e.g. sums or counts, that can be compared and aggregated.
2.3. Split strings into several columns, when necessary.
2.4. Correct some of the data values manually, when necessary.
Prerequisites¶
To run this script (Download) on a local machine you need:
- Python 3.4.
- pandas
- numpy
Inspired by¶
- Kaggle project: Detail Analysis of various Hospital factors
- Wheelan, Charles J. Naked Statistics. 2013
This code is on GitHub!