Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

  • Grab candy 🍫

  • Close your devices

311

Pandas

  • A Python package (bundled up code that you can reuse)

  • Very common for data science in Python

  • A lot like R

    • Both organize around “data frames”

Challenge

Complete the demos and exercise today with generative AI only.

  • Allowed

    • Prompts

    • Copy-pasting

  • Not allowed

    • Googling

    • Editing

I’ll be demoing with CUIT Chat, which is free to use.

Load data

Pull data from:

https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip

We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).

# our code here

If you see a DtypeWarning, ignore it for now. We’ll come back to it.

Preview the data

# our code here

Pandas data structures

Diagram showing a DataFrame, Series, labels, and indexes

DataFrame information

# our code here

Analysis

Which complaints are most common?

# code goes here

What’s the most frequent request per agency?

# code goes here

Data cleaning

Minion character vacuuming

Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data

https://hevodata.com/learn/data-cleansing-a-simplified-guide/

Where have you had to do data cleaning?

Things to check for

From my workshop on data cleaning:

  • Missing data

    • Empty values

  • Bad (junk) values

    • Duplicates

    • Mismatched types/formatting

  • Categorical data

    • Unique values (cardinality)

    • Value counts

  • Continuous values

    • Ranges

    • Spread (distribution)

Data cleaning mnemonic

  • Empty

  • Bad

  • Unique

  • Spread

Demo: Exclude bad records from the DataFrame

Let’s look at the complaint types.

# code goes here

How should we go about cleaning those up?

# code goes here

Dealing with dtypes

DtypeWarning: Columns (8,20,31,34) have mixed types.
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2                  299624 non-null  object 
 13  Intersection Street 1           107377 non-null  object 
 14  Intersection Street 2           107042 non-null  object 
 15  Address Type                    451006 non-null  object 
 16  City                            476632 non-null  object 
 17  Landmark                        32516 non-null   object 
 18  Facility Type                   134918 non-null  object 
 19  Status                          499958 non-null  object 
 20  Due Date                        171534 non-null  object 
 21  Resolution Description          457354 non-null  object 
 22  Resolution Action Updated Date  488788 non-null  object 
 23  Community Board                 499958 non-null  object 
 24  BBL                             407338 non-null  float64
 25  Borough                         499958 non-null  object 
 26  X Coordinate (State Plane)      470815 non-null  float64
 27  Y Coordinate (State Plane)      470815 non-null  float64
 28  Open Data Channel Type          499958 non-null  object 
 29  Park Facility Name              499931 non-null  object 
 30  Park Borough                    499958 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4474 non-null    object 
 34  Bridge Highway Name             696 non-null     object 
 35  Bridge Highway Direction        765 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1007 non-null    object 
 38  Latitude                        470815 non-null  float64
 39  Longitude                       470815 non-null  float64
 40  Location                        470815 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
list(requests["Incident Zip"].unique())
['11235', '11221', '11693', '11216', '10465', '11367', '10459', '11101', '11362', '10014', '11234', '11436', '10305', '10467', '11208', '10451', '11419', '11237', '11220', '10469', '11385', '10470', '11694', '10036', nan, '10473', '11435', '10040', '10472', '11225', '10019', '11434', '11226', '10010', '11211', '11421', '10026', '10013', '11423', '10002', '10453', '11213', '11104', '11249', '11361', '11233', '11224', '11374', '10025', '10022', '11214', '11209', '11366', '10304', '10027', '11378', '11206', '10021', '11364', '10065', '10456', '10314', '10312', '11212', '11379', '10462', '11231', '10460', '11416', '10001', '11357', '11413', '11210', '11217', '11223', '11417', '11418', '11218', '11230', '11207', '11691', '10468', '10007', '10310', '10306', '11103', '11105', '11433', '11203', '10307', '11229', '11372', '10032', '11420', '10017', '10301', '11368', '11201', '11365', '11422', '10452', '11377', '10029', '10003', '10075', '11222', '10128', '11415', '11204', '10030', '11432', '10308', '11102', '10016', '10463', '11412', '10011', '11106', '10457', '11375', '11356', '11228', '11369', '10458', '11411', '11215', '10309', '11358', '11355', '11219', '10031', '10303', '11232', '10302', '11238', '10005', '11429', '11205', '10023', '11373', '10033', '10039', '10028', '11363', '11354', '11692', '10455', '11370', '10035', '10012', '10024', '10009', '10034', '11001', '10466', '11427', '11004', '10454', '11414', '11360', '10461', '10018', '10006', '11236', '11428', '10474', '10471', '10037', '10475', '11430', '10119', '10038', '11426', '11239', '10151', '10120', '10112', '10168', '10004', '10464', '18773-9640', '10282', '11109', '10280', '07114', '11040', '10000', '07090', '10020', '00083', '10044', '10069', '10118', '07102', '11359', '11005', '10169', '11697', '10115', 11226.0, 11236.0, 10466.0, 11210.0, 10452.0, 10457.0, 10304.0, 11416.0, 11205.0, 11214.0, 11692.0, 11101.0, 11201.0, 10470.0, 11375.0, 10305.0, 11420.0, 11421.0, 11419.0, 11412.0, 10309.0, 11385.0, 11379.0, 11356.0, 10021.0, 11417.0, 11377.0, 11423.0, 10001.0, 10469.0, 10463.0, 11372.0, 11422.0, 10017.0, 11368.0, 11213.0, 11220.0, 10312.0, 10025.0, 11357.0, 11230.0, 10456.0, 11104.0, 11232.0, 11208.0, 11206.0, 11229.0, 10465.0, 10472.0, 10039.0, 10016.0, 10009.0, 11355.0, 11211.0, 10040.0, 11366.0, 11234.0, 10013.0, 11207.0, 10032.0, 10036.0, 10029.0, 11204.0, 11435.0, 11373.0, 10007.0, 11216.0, 10306.0, 11378.0, 11414.0, 10314.0, 10128.0, 10027.0, 11221.0, 10024.0, 11364.0, 10468.0, 11233.0, 10014.0, 11238.0, 11374.0, 11105.0, 11219.0, 11106.0, 10003.0, 10026.0, 11411.0, 11223.0, 11693.0, 11365.0, 11209.0, 11103.0, 11361.0, 11427.0, 10467.0, 11203.0, 11102.0, 10455.0, 10454.0, 11225.0, 11212.0, 11434.0, 10461.0, 11369.0, 11432.0, 10031.0, 11217.0, 11235.0, 10458.0, 10453.0, 11215.0, 11367.0, 11231.0, 11413.0, 11358.0, 11694.0, 10462.0, 11218.0, 11430.0, 10473.0, 10011.0, 10460.0, 10019.0, 10310.0, 10037.0, 10451.0, 10301.0, 10010.0, 11354.0, 11222.0, 11691.0, 10034.0, 10308.0, 10012.0, 11237.0, 10801.0, 10030.0, 11428.0, 11418.0, 10033.0, 10459.0, 11429.0, 10065.0, 10022.0, 10005.0, 10002.0, 10035.0, 11370.0, 10004.0, 11415.0, 11426.0, 11362.0, 10006.0, 11360.0, 10038.0, 11249.0, 11001.0, 11224.0, 10075.0, 10018.0, 10302.0, 11436.0, 11433.0, 10474.0, 10023.0, 11004.0, 10028.0, 11228.0, 10475.0, 10069.0, 11363.0, 10303.0, 10282.0, 10307.0, 10103.0, 10280.0, 10471.0, 10271.0, 10000.0, 10464.0, 10107.0, 7078.0, 10044.0, 10020.0, 56303.0, 10119.0, 11040.0, 11386.0, 10952.0, 6811.0, 11239.0, 10121.0, 37214.0, 10538.0, 10112.0, 10279.0, 11109.0, 11371.0, 18017.0, 7115.0, 77036.0, 7114.0, '10123', '11030', '10801', '92626-1902', '07302', '18773', '12222', '10710', '10103', '07057', '10162', '11582', '10281', '10271', '10107', 'HARRISBURG', '11735', '07305', 'N5X3A6', '11746', '11371', '23450', 11580.0, 10591.0, 11005.0, 19034.0, 11596.0, 11779.0, 7621.0, 11021.0, 11241.0, 10169.0, 100000.0, 11242.0, 10151.0, 11697.0, 7086.0, 10177.0, 10118.0, 10105.0, 10152.0, 10168.0, 7093.0, 10917.0, 10110.0, 10153.0, 10178.0, 11570.0, 10601.0, 10704.0, 7424.0, 10281.0, 10158.0, '10158', '10172', '10179', 'IDK', '11801', '10601', '11590', '10155', '11202', '1801', '11581', 11359.0, 11758.0, 10278.0, 43017.0, 10154.0, 11553.0, 10162.0, 11695.0, 10041.0, 11741.0, 98335.0, 14814.0, 10111.0, '12345', '11572', '11520', '14614-195', '10121', '10105', '10701', 0.0, 17106.0, 979113.0, 10120.0, 12345.0, 11030.0, 11797.0, 100.0, 11710.0, 33624.0, 8682.0, 11747.0, 1757.0, 11561.0, 7304.0, 6851.0, 11590.0, 94267.0, 10167.0, 11749.0, 11756.0, 10174.0, 10550.0, 89119.0, 14068.0, 11722.0, 11520.0, 6460.0, 32255.0, 10173.0, 10165.0, 11946.0, '29616-0759', '07032', '10278', '11575', 11963.0, 10106.0, 83.0, 11566.0, 6870.0, 7001.0, 10710.0, 11735.0, 11572.0, '10165', '10279', '11251', 'NJ 07114', '10106', '07666', '11516', '10177', '10170', '43215-1441', '00000', '08081', '10803', 11507.0, 11701.0, 11563.0, 7047.0, 3108.0, '11021', '07003', '10152', '07029', '10041', '31093', '11735-0230', 89118.0, 11803.0, 11559.0, 11565.0, 7080.0, 12601.0, 10155.0, 10171.0, 7208.0, 11757.0, 11042.0, '1101', '10111', '10173', '10096', '07087', 'DID N', '10956', 10048.0, 10123.0, 10122.0, 11251.0]

ZIP codes look numeric, but aren’t really.

requests = pd.read_csv(url, dtype={"Incident Zip": "string"})
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_7524/3312645178.py:1: DtypeWarning: Columns (20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(url, dtype={"Incident Zip": "string"})

We fixed the dtype warning for column 8 (Incident Zip).

list(requests["Incident Zip"].unique())
['11235', '11221', '11693', '11216', '10465', '11367', '10459', '11101', '11362', '10014', '11234', '11436', '10305', '10467', '11208', '10451', '11419', '11237', '11220', '10469', '11385', '10470', '11694', '10036', <NA>, '10473', '11435', '10040', '10472', '11225', '10019', '11434', '11226', '10010', '11211', '11421', '10026', '10013', '11423', '10002', '10453', '11213', '11104', '11249', '11361', '11233', '11224', '11374', '10025', '10022', '11214', '11209', '11366', '10304', '10027', '11378', '11206', '10021', '11364', '10065', '10456', '10314', '10312', '11212', '11379', '10462', '11231', '10460', '11416', '10001', '11357', '11413', '11210', '11217', '11223', '11417', '11418', '11218', '11230', '11207', '11691', '10468', '10007', '10310', '10306', '11103', '11105', '11433', '11203', '10307', '11229', '11372', '10032', '11420', '10017', '10301', '11368', '11201', '11365', '11422', '10452', '11377', '10029', '10003', '10075', '11222', '10128', '11415', '11204', '10030', '11432', '10308', '11102', '10016', '10463', '11412', '10011', '11106', '10457', '11375', '11356', '11228', '11369', '10458', '11411', '11215', '10309', '11358', '11355', '11219', '10031', '10303', '11232', '10302', '11238', '10005', '11429', '11205', '10023', '11373', '10033', '10039', '10028', '11363', '11354', '11692', '10455', '11370', '10035', '10012', '10024', '10009', '10034', '11001', '10466', '11427', '11004', '10454', '11414', '11360', '10461', '10018', '10006', '11236', '11428', '10474', '10471', '10037', '10475', '11430', '10119', '10038', '11426', '11239', '10151', '10120', '10112', '10168', '10004', '10464', '18773-9640', '10282', '11109', '10280', '07114', '11040', '10000', '07090', '10020', '00083', '10044', '10069', '10118', '07102', '11359', '11005', '10169', '11697', '10115', '10801', '10103', '10271', '10107', '07078', '56303', '11386', '10952', '06811', '10121', '37214', '10538', '10279', '11371', '18017', '07115', '77036', '10123', '11030', '92626-1902', '07302', '18773', '12222', '10710', '07057', '10162', '11582', '10281', 'HARRISBURG', '11735', '07305', 'N5X3A6', '11746', '23450', '11580', '10591', '19034', '11596', '11779', '07621', '11021', '11241', '100000', '11242', '07086', '10177', '10105', '10152', '07093', '10917', '10110', '10153', '10178', '11570', '10601', '10704', '07424', '10158', '10172', '10179', 'IDK', '11801', '11590', '10155', '11202', '1801', '11581', '11758', '10278', '43017', '10154', '11553', '11695', '10041', '11741', '98335', '14814', '10111', '12345', '11572', '11520', '14614-195', '10701', '00000', '17106', '979113', '11797', '100', '11710', '33624', '8682', '11747', '01757', '11561', '07304', '000000', '06851', '94267', '10167', '11749', '11756', '10174', '10550', '89119', '14068', '11722', '06460', '32255', '10173', '10165', '11946', '29616-0759', '07032', '11575', '11963', '10106', '11566', '06870', '07001', '11251', 'NJ 07114', '07666', '11516', '10170', '43215-1441', '08081', '10803', '11507', '11701', '11563', '07047', '03108', '07003', '07029', '31093', '11735-0230', '89118', '11803', '11559', '11565', '07080', '12601', '10171', '07208', '11757', '11042', '1101', '10096', '07087', 'DID N', '10956', '10048', '10122']

Find invalid ZIP codes

Use a regular expression (regex) to find strings that match a pattern:

^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string

regex101 is useful for testing them.

# find valid ZIP codes
valid_zips = requests["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")

# filter the DataFrame to only invalid ZIP codes
invalid_zips = ~valid_zips
requests_with_invalid_zips = requests[invalid_zips]
requests_with_invalid_zips["Incident Zip"]
55017 HARRISBURG 58100 N5X3A6 80798 100000 120304 IDK 123304 1801 173518 14614-195 192034 979113 201463 100 207158 8682 216745 000000 325071 NJ 07114 425985 1101 441166 DID N Name: Incident Zip, dtype: string

Clear any invalid ZIP codes:

requests.loc[invalid_zips, "Incident Zip"] = None

Additonal data cleaning tips:

  • Hard part is finding what needs to be done

  • Will be specific to your use case

  • Document what you did, since it will affect your results

Projects

In real/ideal world, start with specific question and find data to answer it:

project flow

Source: Big Data and Social Science

Data needed often doesn’t exist or is hard (or impossible) to find/access

project flow

This will apply to all our Projects.