Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Cleaning

  1. Make messy data.

    1. Open this shared 311 data.

    2. Make a couple edits to mess it up.

  2. Download the data.

    1. Click File.

    2. Click Download.

    3. Click Comma Separated Values (.csv).

Now this data is chaotic with some of your suggestions. How do we clean such a dataset?

Guide

  • Inspect the Data

  • Check missing values

  • Fix formatting issues and spaces

  • Standardize categoricals

  • Handle special characters

  • Handle bad [illogical] data

Merging data

  • This is totally separate from the Data Cleaning above.

  • The instructions here are intentionally incomplete.

Step 0

Find an NYC dataset with a borough column.

  • Use Scout to filter by column name.

  • Don’t spend too long on this step.

  • Keep the dataset small (under 500,000-ish rows) to make it easier to work with.

What’s the URL of your dataset?

YOUR RESPONSE HERE

Step 1

Save and load the dataset.

# your code here

Step 2

Download and load the Population by Borough dataset.

# your code here

Step 3

Use merge() to combine the two, and output the resulting table.

# your code here

Step 4

Using the two datasets above, use pandas to produce an aggregate per-capita statistic by borough.

The dataset you chose before may not work for this. That’s fine, pick another.

Hint

You’re creating a “number of [thing] per capita by borough” table.

  1. Do a groupby() on the original dataset.

  2. Join with the populations by borough.

  3. Compute the per-capita values as a new column.

# your code here

Step 5

Submit.