Lab 9#
The goal here is to practice joining datasets through pandas. Hint: The instructions here are intentionally incomplete.
Data Cleaning#
Download the dataset at Link. The original data had the following columns:
Id - Unique identifier for a person
Name - Person Name
Age - Age
Gender - M for Male and F for Female
Salary - Annual salary in USD
State - Two letter state identifier denoting one of the US States
Join-Date - Date of joining in the form yyyy-mm-dd
But now this data is chaotic with some of your suggestions. How do we clean such a dataset?
Guide:#
Inspect the Data
Check missing values
Fix formatting issues and spaces
Standardize categoricals
Handle special characters
Handle bad [illogical] data
Step 1#
Find an NYC dataset with a borough column.
Use Scout to filter by column name.
Don’t spend too long on this step.
Keep the dataset small (under 500,000-ish rows) to make it easier to work with.
What’s the URL of your dataset?
YOUR RESPONSE HERE
Step 2#
Save and load the dataset.
# your code here
Step 3#
Download and load the Population by Borough dataset.
# your code here
Step 4#
Use merge()
to combine the two, and output the resulting table.
# your code here
Step 5#
Using the two datasets above, use pandas to produce an aggregate per-capita statistic by borough.
The dataset you chose before may not work for this. That’s fine, pick another.
Hint#
You’re creating a “number of [thing] per capita by borough” table.
Do a
groupby()
on the original dataset.Join with the populations by borough.
Compute the per-capita values as a new column.
# your code here