Data Analysis Tutorial

Overview

The goal of this tutorial is to talk about the important parts of beginning data analysis.

The typical analysis pipeline goes through the following stages:

  1. Think about the data you would like

  2. Either find a way to collect that data, or find data that already exists
    • sometimes you might have to compromise on data because it’s easier to just use stuff that exists already
    • I have provided links to datasets above.
    • For this tutorial, there is a titanic dataset
  3. Write code that takes the data from a file or database and loads it into a data structure
    • We will be using Pandas, a data management library
    • Pandas makes manipulating data really easy
  4. Write code that puts the data into different forms that match the task you want to do.
    • For instance, if you want to view interesting properties of your data as a scatter plot, you need to get two lists: one for the x positions and 1 for the y positions
    • You should be thinking about what kinds of things the data can tell you

I will be writing this tutorial while looking at the titanic dataset. The titanic dataset is a list of passengers, information about them, and whether they survived or not.

Getting the Data

I have made the data easy to get:

from urllib import request
import pandas as pd
filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv'
response = request.urlopen(filepath)
df = pd.read_csv(response)
df = df.fillna(0)

Using Pandas and Matplotlib

Some example tutorials

  1. Simple Graphics
  2. Beautiful Plots

Some simple operations

Selecting a column

age_column = df['Age']

Selecting a subset

df2 = df[age_column > 0]

View the columns

print(df2.columns)

Visualize a scatter plot

plt.scatter(df2['Survived'], df2['Age']);
# or with columns out
surv_col = df2['Survived']
age_col = df2['Age']

Seaborn

If you don’t already have it, to install seaborn, type in a single cell in your Jupyter Notebook:

!pip install seaborn

Then, you can do the following:

import seaborn as sns
sns.barplot(data=df, x='Pclass', y='Survived')

You can see more examples of seaborn plots at the seaborn website

Some examples to get you started:

sns.countplot(data=df, x='Sex', hue='Survived')

### do these in different cells otherwise they will try to plot on top of each other
sns.factorplot(data=df, x='Pclass', y='Age', col='Sex', kind='swarm', hue='Survived', x_order=[1, 2, 3])

Science

To use data for science, you want to get summarize what happened. In other words, you want to tell a story with the data. To do this, you have to look at the different properties: counts, means, proportions, etc.

A good way to formulate a scientific question is to think about different groups. If the rate at which something happens is different between the two groups, then there is an effect of group.

Some terminology

  1. Proportion: A proportion is a number between 0 and 1 that signifies the part to whole relationship. - If you eat half of a cake, the proportion you ate is 0.5
  2. Percentage: A percentage is a number between 0 and 100 that signifies the part to whole relationship - If you eat half of a cake, the percentage is 50%

Questions you can ask

  1. How many people were on the Titanic?
  2. What percentage of the passengers did not survive?
  3. How many of the passengers were male? How many were female?
  4. How many male passengers survived? How many female? Is there an interesting relationship?
  5. What is the proportion of 3rd class passengers who survived?
  6. Is there an effect of class on the survivability of the gender?
  7. What is the mean age per class?

Additional setup

A version I was working that renames and cleans a version of the dataset:

from urllib import request
import pandas as pd
import seaborn as sns
%matplotlib inline
filepath = 'https://gist.githubusercontent.com/braingineer/5d15057ac482ee0130b6d0e6f9cc9311/raw/d4eefaecc98b342ec578cf3512184556e8856750/titanic.csv'
response = request.urlopen(filepath)
df = pd.read_csv(response)
df = df.fillna(0)
cols = df.columns.values
idx = list(cols).index('Pclass')
cols[idx] = "Class"
df.columns = cols
df_clean = df[df['Age']>0]

And a couple extra plots I was looking at:

### super fancy
sns.factorplot(data=df_clean, kind='violin', split=True, inner='stick', scale='count', x='Class', y='Age', hue='Survived', col='Sex')

### really sad
sns.factorplot(data=df_clean, kind='bar', col='Class', x='SibSp', y='Age', hue='Survived', row='Sex')