Data Cleaning¶
Make messy data.
Open this shared 311 data.
Make a couple edits to mess it up.
Download the data.
Click
File.Click
Download.Click
Comma Separated Values (.csv).
Now this data is chaotic with some of your suggestions. How do we clean such a dataset?
Guide¶
Inspect the Data
Check missing values
Fix formatting issues and spaces
Standardize categoricals
Handle special characters
Handle bad [illogical] data
Merging data¶
This is totally separate from the Data Cleaning above.
The instructions here are intentionally incomplete.
Step 0¶
Find an NYC dataset with a borough column.
Use Scout to filter by column name.
Don’t spend too long on this step.
Keep the dataset small (under 500,000-ish rows) to make it easier to work with.
What’s the URL of your dataset?
YOUR RESPONSE HERE
Step 1¶
Save and load the dataset.
# your code hereStep 2¶
Download and load the Population by Borough dataset.
# your code here# your code hereStep 4¶
Using the two datasets above, use pandas to produce an aggregate per-capita statistic by borough.
The dataset you chose before may not work for this. That’s fine, pick another.
Hint¶
You’re creating a “number of [thing] per capita by borough” table.
Do a
groupby()on the original dataset.Join with the populations by borough.
Compute the per-capita values as a new column.
# your code here