Lab 9#

General notebook information

The goal here is to practice joining datasets through pandas. Hint: The instructions here are intentionally incomplete.

Data Cleaning#

Download the dataset at Link. The original data had the following columns:

Id - Unique identifier for a person
Name - Person Name
Age - Age
Gender - M for Male and F for Female
Salary - Annual salary in USD
State - Two letter state identifier denoting one of the US States
Join-Date - Date of joining in the form yyyy-mm-dd

But now this data is chaotic with some of your suggestions. How do we clean such a dataset?

Guide:#

  • Inspect the Data

  • Check missing values

  • Fix formatting issues and spaces

  • Standardize categoricals

  • Handle special characters

  • Handle bad [illogical] data

Step 1#

Find an NYC dataset with a borough column.

  • Use Scout to filter by column name.

  • Don’t spend too long on this step.

  • Keep the dataset small (under 500,000-ish rows) to make it easier to work with.

What’s the URL of your dataset?

YOUR RESPONSE HERE

Step 2#

Save and load the dataset.

# your code here

Step 3#

Download and load the Population by Borough dataset.

# your code here

Step 4#

Use merge() to combine the two, and output the resulting table.

# your code here

Step 5#

Using the two datasets above, use pandas to produce an aggregate per-capita statistic by borough.

The dataset you chose before may not work for this. That’s fine, pick another.

Hint#

You’re creating a “number of [thing] per capita by borough” table.

  1. Do a groupby() on the original dataset.

  2. Join with the populations by borough.

  3. Compute the per-capita values as a new column.

# your code here

Step 6#

Submit.