Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

# a bit of setup just for the demo; ignore me
import numpy as np

np.random.seed(5)

Load data

import pandas as pd

requests = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip")
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_17069/3437185872.py:3: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(

Preview the data

requests
Loading...

.head() defaults to providing the first 5 if you don’t specify a number. Can also use .tail().

requests.head(10)
Loading...
requests.sample(5)
Loading...

DataFrame information

requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      500000 non-null  int64  
 1   Created Date                    500000 non-null  object 
 2   Closed Date                     476156 non-null  object 
 3   Agency                          500000 non-null  object 
 4   Agency Name                     500000 non-null  object 
 5   Complaint Type                  500000 non-null  object 
 6   Descriptor                      492534 non-null  object 
 7   Location Type                   392590 non-null  object 
 8   Incident Zip                    480411 non-null  object 
 9   Incident Address                434544 non-null  object 
 10  Street Name                     434519 non-null  object 
 11  Cross Street 1                  300838 non-null  object 
 12  Cross Street 2                  299635 non-null  object 
 13  Intersection Street 1           107383 non-null  object 
 14  Intersection Street 2           107048 non-null  object 
 15  Address Type                    451018 non-null  object 
 16  City                            476649 non-null  object 
 17  Landmark                        32521 non-null   object 
 18  Facility Type                   134923 non-null  object 
 19  Status                          500000 non-null  object 
 20  Due Date                        171559 non-null  object 
 21  Resolution Description          457389 non-null  object 
 22  Resolution Action Updated Date  488801 non-null  object 
 23  Community Board                 500000 non-null  object 
 24  BBL                             407355 non-null  float64
 25  Borough                         500000 non-null  object 
 26  X Coordinate (State Plane)      470830 non-null  float64
 27  Y Coordinate (State Plane)      470830 non-null  float64
 28  Open Data Channel Type          500000 non-null  object 
 29  Park Facility Name              499973 non-null  object 
 30  Park Borough                    500000 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4475 non-null    object 
 34  Bridge Highway Name             697 non-null     object 
 35  Bridge Highway Direction        766 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1027 non-null    object 
 38  Latitude                        470830 non-null  float64
 39  Longitude                       470830 non-null  float64
 40  Location                        470830 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB

Analysis

Which complaints are most common?

requests["Complaint Type"].value_counts().head()
Complaint Type Noise - Residential 41311 HEAT/HOT WATER 39095 Illegal Parking 34297 Request Large Bulky Item Collection 30939 Blocked Driveway 25530 Name: count, dtype: int64

Equivalent to:

requests.groupby("Complaint Type").size().nlargest()
Complaint Type Noise - Residential 41311 HEAT/HOT WATER 39095 Illegal Parking 34297 Request Large Bulky Item Collection 30939 Blocked Driveway 25530 dtype: int64

What’s the most frequent request per agency?

requests.groupby(["Agency", "Complaint Type"]).size().to_frame(name="count")
Loading...

Exclude bad records from the DataFrame

Let’s look at the complaint types.

requests["Complaint Type"].unique()
array(['Street Condition', 'HEAT/HOT WATER', 'Noise - Residential', 'Illegal Parking', 'Request Large Bulky Item Collection', 'Noise', 'Noise - Street/Sidewalk', 'Electronics Waste Appointment', 'Blocked Driveway', 'Dirty Conditions', 'Curb Condition', 'Noise - Commercial', 'General Construction/Plumbing', 'Traffic Signal Condition', 'Street Light Condition', 'Lead', 'Street Sign - Damaged', 'Noise - Vehicle', 'New Tree Request', 'Sanitation Condition', 'Mosquitoes', 'WATER LEAK', 'UNSANITARY CONDITION', 'Root/Sewer/Sidewalk Condition', 'Dead/Dying Tree', 'Derelict Vehicles', 'Collection Truck Noise', 'Sewer', 'GENERAL', 'Overflowing Litter Baskets', 'Vacant Lot', 'Sidewalk Condition', 'PAINT/PLASTER', 'Building/Use', 'Street Sign - Dangling', 'Construction Safety Enforcement', 'PLUMBING', 'Derelict Vehicle', 'Homeless Person Assistance', 'ELECTRIC', 'Water System', 'Damaged Tree', 'Missed Collection (All Materials)', 'Animal Abuse', 'Borough Office', 'Elevator', 'Food Establishment', 'Maintenance or Facility', 'FLOORING/STAIRS', 'Consumer Complaint', 'Overgrown Tree/Branches', 'Graffiti', 'APPLIANCE', 'Other Enforcement', 'DOF Property - Payment Issue', 'Hazardous Materials', 'Water Conservation', 'Water Quality', 'Plumbing', 'DOF Property - Update Account', 'DOOR/WINDOW', 'Housing - Low Income Senior', 'BEST/Site Safety', 'SCRIE', 'Animal in a Park', 'DOF Parking - Payment Issue', 'For Hire Vehicle Complaint', 'Benefit Card Replacement', 'DOF Property - Reduction Issue', 'Special Projects Inspection Team (SPIT)', 'SAFETY', 'Taxi Complaint', 'Beach/Pool/Sauna Complaint', 'Homeless Encampment', 'DOF Property - RPIE Issue', 'Recycling Enforcement', 'DPR Internal', 'DRIE', 'Street Sign - Missing', 'Taxi Report', 'DOF Parking - Request Status', 'Air Quality', 'Broken Parking Meter', 'Mold', 'HPD Literature Request', 'Poison Ivy', 'Non-Emergency Police Matter', 'DOF Property - Request Copy', 'Housing Options', 'Illegal Tree Damage', 'Traffic', 'Noise - Park', 'Asbestos', 'DCA / DOH New License Application Request', 'Cooling Tower', 'Drinking', 'Indoor Sewage', 'Vending', 'Emergency Response Team (ERT)', 'Electrical', 'DOF Property - Owner Issue', 'Parking Card', 'Indoor Air Quality', 'Violation of Park Rules', 'Mobile Food Vendor', 'Highway Condition', 'ELEVATOR', 'Noise - Helicopter', 'Unsanitary Animal Pvt Property', 'Rodent', 'Standing Water', 'Food Poisoning', 'Litter Basket / Request', 'Day Care', 'Executive Inspections', 'DOF Parking - Tax Exemption', 'OUTSIDE BUILDING', 'DOF Parking - Request Copy', 'DOF Property - Property Value', 'School Maintenance', 'Question', 'Industrial Waste', 'Senior Center Complaint', 'Smoking', 'Quality of Life', 'Derelict Bicycle', 'Ferry Complaint', 'Sweeping/Missed', 'Ferry Inquiry', 'Sweeping/Inadequate', 'Found Property', 'LinkNYC', 'Sustainability Enforcement', 'Taxpayer Advocate Inquiry', 'Investigations and Discipline (IAD)', 'Noise - House of Worship', 'Scaffold Safety', 'Pet Shop', 'Bus Stop Shelter Complaint', 'Forms', 'Elder Abuse', 'Drug Activity', 'Unsanitary Pigeon Condition', 'Comments', 'Boilers', 'Building Marshals office', 'Home Delivered Meal - Missed Delivery', 'Bike/Roller/Skate Chronic', 'Posting Advertisement', 'Research Questions', 'Urinating in Public', 'Public Payphone Complaint', 'Unleashed Dog', 'Facades', 'Illegal Animal Sold', 'Cranes and Derricks', 'Bus Stop Shelter Placement', 'Panhandling', 'Bridge Condition', 'Illegal Animal Kept as Pet', 'OEM Literature Request', 'Advocate-Personal Exemptions', 'Abandoned Vehicle', "Alzheimer's Care", 'Taxi Compliment', 'Disorderly Youth', 'Public Assembly', 'Advocate-Prop Refunds/Credits', 'Unsanitary Condition', 'DOF Parking - DMV Clearance', 'Home Delivered Meal Complaint', 'Transportation Provider Complaint', 'Ferry Permit', 'For Hire Vehicle Report', 'Harboring Bees/Wasps', 'Advocate - Other', 'Home Care Provider Complaint', 'Miscellaneous Categories', 'Municipal Parking Facility', 'FATF', 'Special Natural Area District (SNAD)', 'Illegal Fireworks', 'Highway Sign - Damaged', 'Drinking Water', 'Calorie Labeling', 'Case Management Agency Complaint', 'Bike Rack Condition', 'Unsanitary Animal Facility', 'Plant', 'Public Toilet', 'Tattooing', 'Animal Facility - No Permit', 'Electronics Waste', 'General', 'Snow', 'Unlicensed Dog', 'Taxi Licensee Complaint', 'FHV Licensee Complaint', 'Overflowing Recycling Baskets', 'Highway Sign - Missing', 'Non-Residential Heat', 'Special Operations', 'Safety', 'Radioactive Material', 'Dispatched Taxi Complaint', 'Lifeguard', 'Window Guard', 'Select Message Type...', 'Bereavement Support Group', 'Sweeping/Missed-Inadequate', 'Squeegee', 'Tunnel Condition', 'Foam Ban Enforcement', 'Request Xmas Tree Collection', 'Advocate-Business Tax', 'Advocate - RPIE', 'Highway Sign - Dangling', 'X-Ray Machine/Equipment', 'Advocate-Co-opCondo Abatement', 'Forensic Engineering', 'Outside Building', 'Lost Property', '(select extractvalue(xmltyp...', "Misc. Comments' and 7702=77...", 'qfix4${695*589}lixaf', "eval(compile('for x in rang...", 'Misc. Comments"|echo 2...', 'Misc. Comments"|ping -...', "Misc. Comments'|echo y9v6fc...", 'file:///c:/windows/win.ini', "Misc. Comments'|ping -c 21 ...", '../../../../../../../../../...', '\\windows\\win.ini', '..././..././..././..././......', 'idexf3mrb7)(!(objectClass=*)', 'Misc. Comments BCC:a5glxzf...', '$(sleep 11)', '@(9313*3464)', "{!xmlparser v='<!DOCTYPE...", '../WEB-INF/web.xml', 'Animal-Abuse', 'Missed Collection', 'DEP Street Condition', 'Green Taxi Report', 'Green Taxi Complaint', 'Bottled Water', 'Single Occupancy Bathroom', 'Appliance'], dtype=object)

Let’s make that a bit easier to read:

complaints = requests["Complaint Type"].unique()
complaints.sort()
list(complaints)
['$(sleep 11)', '(select extractvalue(xmltyp...', '..././..././..././..././......', '../../../../../../../../../...', '../WEB-INF/web.xml', '@(9313*3464)', 'APPLIANCE', 'Abandoned Vehicle', 'Advocate - Other', 'Advocate - RPIE', 'Advocate-Business Tax', 'Advocate-Co-opCondo Abatement', 'Advocate-Personal Exemptions', 'Advocate-Prop Refunds/Credits', 'Air Quality', "Alzheimer's Care", 'Animal Abuse', 'Animal Facility - No Permit', 'Animal in a Park', 'Animal-Abuse', 'Appliance', 'Asbestos', 'BEST/Site Safety', 'Beach/Pool/Sauna Complaint', 'Benefit Card Replacement', 'Bereavement Support Group', 'Bike Rack Condition', 'Bike/Roller/Skate Chronic', 'Blocked Driveway', 'Boilers', 'Borough Office', 'Bottled Water', 'Bridge Condition', 'Broken Parking Meter', 'Building Marshals office', 'Building/Use', 'Bus Stop Shelter Complaint', 'Bus Stop Shelter Placement', 'Calorie Labeling', 'Case Management Agency Complaint', 'Collection Truck Noise', 'Comments', 'Construction Safety Enforcement', 'Consumer Complaint', 'Cooling Tower', 'Cranes and Derricks', 'Curb Condition', 'DCA / DOH New License Application Request', 'DEP Street Condition', 'DOF Parking - DMV Clearance', 'DOF Parking - Payment Issue', 'DOF Parking - Request Copy', 'DOF Parking - Request Status', 'DOF Parking - Tax Exemption', 'DOF Property - Owner Issue', 'DOF Property - Payment Issue', 'DOF Property - Property Value', 'DOF Property - RPIE Issue', 'DOF Property - Reduction Issue', 'DOF Property - Request Copy', 'DOF Property - Update Account', 'DOOR/WINDOW', 'DPR Internal', 'DRIE', 'Damaged Tree', 'Day Care', 'Dead/Dying Tree', 'Derelict Bicycle', 'Derelict Vehicle', 'Derelict Vehicles', 'Dirty Conditions', 'Disorderly Youth', 'Dispatched Taxi Complaint', 'Drinking', 'Drinking Water', 'Drug Activity', 'ELECTRIC', 'ELEVATOR', 'Elder Abuse', 'Electrical', 'Electronics Waste', 'Electronics Waste Appointment', 'Elevator', 'Emergency Response Team (ERT)', 'Executive Inspections', 'FATF', 'FHV Licensee Complaint', 'FLOORING/STAIRS', 'Facades', 'Ferry Complaint', 'Ferry Inquiry', 'Ferry Permit', 'Foam Ban Enforcement', 'Food Establishment', 'Food Poisoning', 'For Hire Vehicle Complaint', 'For Hire Vehicle Report', 'Forensic Engineering', 'Forms', 'Found Property', 'GENERAL', 'General', 'General Construction/Plumbing', 'Graffiti', 'Green Taxi Complaint', 'Green Taxi Report', 'HEAT/HOT WATER', 'HPD Literature Request', 'Harboring Bees/Wasps', 'Hazardous Materials', 'Highway Condition', 'Highway Sign - Damaged', 'Highway Sign - Dangling', 'Highway Sign - Missing', 'Home Care Provider Complaint', 'Home Delivered Meal - Missed Delivery', 'Home Delivered Meal Complaint', 'Homeless Encampment', 'Homeless Person Assistance', 'Housing - Low Income Senior', 'Housing Options', 'Illegal Animal Kept as Pet', 'Illegal Animal Sold', 'Illegal Fireworks', 'Illegal Parking', 'Illegal Tree Damage', 'Indoor Air Quality', 'Indoor Sewage', 'Industrial Waste', 'Investigations and Discipline (IAD)', 'Lead', 'Lifeguard', 'LinkNYC', 'Litter Basket / Request', 'Lost Property', 'Maintenance or Facility', 'Misc. Comments BCC:a5glxzf...', 'Misc. Comments"|echo 2...', 'Misc. Comments"|ping -...', "Misc. Comments' and 7702=77...", "Misc. Comments'|echo y9v6fc...", "Misc. Comments'|ping -c 21 ...", 'Miscellaneous Categories', 'Missed Collection', 'Missed Collection (All Materials)', 'Mobile Food Vendor', 'Mold', 'Mosquitoes', 'Municipal Parking Facility', 'New Tree Request', 'Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle', 'Non-Emergency Police Matter', 'Non-Residential Heat', 'OEM Literature Request', 'OUTSIDE BUILDING', 'Other Enforcement', 'Outside Building', 'Overflowing Litter Baskets', 'Overflowing Recycling Baskets', 'Overgrown Tree/Branches', 'PAINT/PLASTER', 'PLUMBING', 'Panhandling', 'Parking Card', 'Pet Shop', 'Plant', 'Plumbing', 'Poison Ivy', 'Posting Advertisement', 'Public Assembly', 'Public Payphone Complaint', 'Public Toilet', 'Quality of Life', 'Question', 'Radioactive Material', 'Recycling Enforcement', 'Request Large Bulky Item Collection', 'Request Xmas Tree Collection', 'Research Questions', 'Rodent', 'Root/Sewer/Sidewalk Condition', 'SAFETY', 'SCRIE', 'Safety', 'Sanitation Condition', 'Scaffold Safety', 'School Maintenance', 'Select Message Type...', 'Senior Center Complaint', 'Sewer', 'Sidewalk Condition', 'Single Occupancy Bathroom', 'Smoking', 'Snow', 'Special Natural Area District (SNAD)', 'Special Operations', 'Special Projects Inspection Team (SPIT)', 'Squeegee', 'Standing Water', 'Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', 'Street Sign - Missing', 'Sustainability Enforcement', 'Sweeping/Inadequate', 'Sweeping/Missed', 'Sweeping/Missed-Inadequate', 'Tattooing', 'Taxi Complaint', 'Taxi Compliment', 'Taxi Licensee Complaint', 'Taxi Report', 'Taxpayer Advocate Inquiry', 'Traffic', 'Traffic Signal Condition', 'Transportation Provider Complaint', 'Tunnel Condition', 'UNSANITARY CONDITION', 'Unleashed Dog', 'Unlicensed Dog', 'Unsanitary Animal Facility', 'Unsanitary Animal Pvt Property', 'Unsanitary Condition', 'Unsanitary Pigeon Condition', 'Urinating in Public', 'Vacant Lot', 'Vending', 'Violation of Park Rules', 'WATER LEAK', 'Water Conservation', 'Water Quality', 'Water System', 'Window Guard', 'X-Ray Machine/Equipment', '\\windows\\win.ini', "eval(compile('for x in rang...", 'file:///c:/windows/win.ini', 'idexf3mrb7)(!(objectClass=*)', 'qfix4${695*589}lixaf', "{!xmlparser v='<!DOCTYPE..."]

Let’s see how frequently these invalid Complaint Type values appear in the data.

Use .groupby().size() to get the count of 311 requests per complaint type value. This is very similar to pivot tables in spreadsheets. See also: value_counts().

with pd.option_context("display.max_rows", 1000):
    display(requests.groupby("Complaint Type").size())
Complaint Type $(sleep 11) 1 (select extractvalue(xmltyp... 1 ..././..././..././..././...... 1 ../../../../../../../../../... 2 ../WEB-INF/web.xml 1 @(9313*3464) 1 APPLIANCE 2539 Abandoned Vehicle 1655 Advocate - Other 26 Advocate - RPIE 1 Advocate-Business Tax 2 Advocate-Co-opCondo Abatement 4 Advocate-Personal Exemptions 24 Advocate-Prop Refunds/Credits 39 Air Quality 1457 Alzheimer's Care 25 Animal Abuse 1282 Animal Facility - No Permit 10 Animal in a Park 456 Animal-Abuse 261 Appliance 1 Asbestos 426 BEST/Site Safety 359 Beach/Pool/Sauna Complaint 42 Benefit Card Replacement 1649 Bereavement Support Group 1 Bike Rack Condition 20 Bike/Roller/Skate Chronic 97 Blocked Driveway 25530 Boilers 449 Borough Office 900 Bottled Water 2 Bridge Condition 64 Broken Parking Meter 2499 Building Marshals office 311 Building/Use 4869 Bus Stop Shelter Complaint 172 Bus Stop Shelter Placement 43 Calorie Labeling 4 Case Management Agency Complaint 4 Collection Truck Noise 36 Comments 16 Construction Safety Enforcement 1740 Consumer Complaint 2892 Cooling Tower 7 Cranes and Derricks 48 Curb Condition 1476 DCA / DOH New License Application Request 186 DEP Street Condition 46 DOF Parking - DMV Clearance 22 DOF Parking - Payment Issue 532 DOF Parking - Request Copy 97 DOF Parking - Request Status 94 DOF Parking - Tax Exemption 157 DOF Property - Owner Issue 280 DOF Property - Payment Issue 1092 DOF Property - Property Value 106 DOF Property - RPIE Issue 80 DOF Property - Reduction Issue 1846 DOF Property - Request Copy 1000 DOF Property - Update Account 802 DOOR/WINDOW 6479 DPR Internal 245 DRIE 233 Damaged Tree 6065 Day Care 98 Dead/Dying Tree 1995 Derelict Bicycle 457 Derelict Vehicle 6056 Derelict Vehicles 6759 Dirty Conditions 6776 Disorderly Youth 10 Dispatched Taxi Complaint 5 Drinking 311 Drinking Water 16 Drug Activity 352 ELECTRIC 4670 ELEVATOR 250 Elder Abuse 43 Electrical 551 Electronics Waste 1 Electronics Waste Appointment 5535 Elevator 3004 Emergency Response Team (ERT) 872 Executive Inspections 73 FATF 23 FHV Licensee Complaint 4 FLOORING/STAIRS 4177 Facades 72 Ferry Complaint 38 Ferry Inquiry 96 Ferry Permit 8 Foam Ban Enforcement 5 Food Establishment 2061 Food Poisoning 549 For Hire Vehicle Complaint 2789 For Hire Vehicle Report 112 Forensic Engineering 1 Forms 57 Found Property 25 GENERAL 4565 General 188 General Construction/Plumbing 9537 Graffiti 4286 Green Taxi Complaint 7 Green Taxi Report 2 HEAT/HOT WATER 39095 HPD Literature Request 405 Harboring Bees/Wasps 31 Hazardous Materials 549 Highway Condition 628 Highway Sign - Damaged 20 Highway Sign - Dangling 3 Highway Sign - Missing 12 Home Care Provider Complaint 4 Home Delivered Meal - Missed Delivery 15 Home Delivered Meal Complaint 13 Homeless Encampment 977 Homeless Person Assistance 3670 Housing - Low Income Senior 768 Housing Options 251 Illegal Animal Kept as Pet 37 Illegal Animal Sold 7 Illegal Fireworks 138 Illegal Parking 34297 Illegal Tree Damage 642 Indoor Air Quality 848 Indoor Sewage 214 Industrial Waste 263 Investigations and Discipline (IAD) 152 Lead 1825 Lifeguard 7 LinkNYC 40 Litter Basket / Request 463 Lost Property 472 Maintenance or Facility 3055 Misc. Comments BCC:a5glxzf... 1 Misc. Comments"|echo 2... 1 Misc. Comments"|ping -... 1 Misc. Comments' and 7702=77... 1 Misc. Comments'|echo y9v6fc... 1 Misc. Comments'|ping -c 21 ... 1 Miscellaneous Categories 34 Missed Collection 55 Missed Collection (All Materials) 8340 Mobile Food Vendor 383 Mold 81 Mosquitoes 139 Municipal Parking Facility 15 New Tree Request 3715 Noise 10937 Noise - Commercial 7894 Noise - Helicopter 303 Noise - House of Worship 148 Noise - Park 792 Noise - Residential 41311 Noise - Street/Sidewalk 16876 Noise - Vehicle 7244 Non-Emergency Police Matter 1628 Non-Residential Heat 169 OEM Literature Request 25 OUTSIDE BUILDING 150 Other Enforcement 1349 Outside Building 2 Overflowing Litter Baskets 140 Overflowing Recycling Baskets 1 Overgrown Tree/Branches 2935 PAINT/PLASTER 9877 PLUMBING 9007 Panhandling 96 Parking Card 2 Pet Shop 28 Plant 22 Plumbing 1482 Poison Ivy 47 Posting Advertisement 23 Public Assembly 1 Public Payphone Complaint 75 Public Toilet 9 Quality of Life 65 Question 171 Radioactive Material 3 Recycling Enforcement 134 Request Large Bulky Item Collection 30939 Request Xmas Tree Collection 2 Research Questions 4 Rodent 6253 Root/Sewer/Sidewalk Condition 1485 SAFETY 1668 SCRIE 770 Safety 65 Sanitation Condition 7151 Scaffold Safety 38 School Maintenance 442 Select Message Type... 1 Senior Center Complaint 52 Sewer 6757 Sidewalk Condition 6488 Single Occupancy Bathroom 4 Smoking 456 Snow 297 Special Natural Area District (SNAD) 14 Special Operations 39 Special Projects Inspection Team (SPIT) 1168 Squeegee 1 Standing Water 484 Street Condition 16895 Street Light Condition 13425 Street Sign - Damaged 1176 Street Sign - Dangling 482 Street Sign - Missing 1080 Sustainability Enforcement 247 Sweeping/Inadequate 55 Sweeping/Missed 174 Sweeping/Missed-Inadequate 11 Tattooing 21 Taxi Complaint 2416 Taxi Compliment 41 Taxi Licensee Complaint 5 Taxi Report 318 Taxpayer Advocate Inquiry 19 Traffic 868 Traffic Signal Condition 7306 Transportation Provider Complaint 7 Tunnel Condition 1 UNSANITARY CONDITION 14913 Unleashed Dog 128 Unlicensed Dog 6 Unsanitary Animal Facility 15 Unsanitary Animal Pvt Property 327 Unsanitary Condition 942 Unsanitary Pigeon Condition 103 Urinating in Public 112 Vacant Lot 318 Vending 873 Violation of Park Rules 492 WATER LEAK 6641 Water Conservation 853 Water Quality 332 Water System 12949 Window Guard 6 X-Ray Machine/Equipment 1 \windows\win.ini 1 eval(compile('for x in rang... 1 file:///c:/windows/win.ini 1 idexf3mrb7)(!(objectClass=*) 1 qfix4${695*589}lixaf 1 {!xmlparser v='<!DOCTYPE... 1 dtype: int64
with pd.option_context("display.max_rows", 1000):
    display(...)

What this code is doing: showing all cells in a DataFrame with rich output.

requests.groupby('Complaint Type').size()

What this code is doing:

  1. Group the records in the dataset based on their Complaint Type value

  2. Count the records that have been grouped together by their shared Complaint Type value

How should we find junk records to delete?

It looks like most invalid complaint types only have a few records. Try excluding all complaint type categories with < 3 records, assuming that all complaint type categories with < 3 instances in the data are bad data entries.

Why 3? It’s arbitrary. We’re looking for trends in the data, and in this case we don’t care about low frequency entries.

Create a DataFrame that captures the count of records per Complaint Type value.

counts = requests.groupby("Complaint Type").size().reset_index(name="count")
counts
# .reset_index(name='count') allows us to name the new column that contains the count of rows
Loading...

Let’s create a Series that only lists the Complaint Type values that have record counts >= 3.

Remember: A single column from a pandas DataFrame is called a Series. It’s essentially a list containing all the values in the column.

valid_complaint_counts = counts[counts["count"] >= 3]
valid_complaint_counts
Loading...

Filter our requests DataFrame to only keep the rows where the Complaint Type value is in the valid_complaint_types Series we created in the previous step. Save the result in a new DataFrame.

valid_complaint_types = valid_complaint_counts["Complaint Type"]
is_valid_complaint = requests["Complaint Type"].isin(valid_complaint_types)
requests_clean = requests[is_valid_complaint]

How can we make sure this worked? Let’s check how many records there were originally in requests vs how many are in requests_clean.

Before:

requests["Unique Key"].size
500000

After:

requests_clean["Unique Key"].size
499958

Great, now those invalid records will be excluded from our analysis!

Another approach to excluding those invalid records would be to use regular expressions (“RegExes”) to find records with weird characters.