Lecture 17: Manipulating data#

  • Questions?

  • How was the lab? 👍/👎

Today’s goal#

  • Which 311 complaints are most common?

  • Which agencies are responsible for handling them?

Read our cleaned 311 Service Requests dataset#

import pandas as pd

url = "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
requests = pd.read_csv(url)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_7524/1515339211.py:4: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(url)

Data cleaning#

Minion character vacuuming

Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data

https://hevodata.com/learn/data-cleansing-a-simplified-guide/

Where have you had to do data cleaning?

Things to check for#

From my workshop on data cleaning:

  • Missing data

    • Empty values

  • Bad (junk) values

    • Duplicates

    • Mismatched types/formatting

  • Categorical data

    • Unique values (cardinality)

    • Value counts

  • Continuous values

    • Ranges

    • Spread (distribution)

Notes:

  • “Values” in this case can be a single cell (in the spreadsheet sense) or a whole row

  • “Missing” or “duplicates” can be columns (Series), tables (DataFrames), rows, or cells

  • “Categorical data” have a fixed set of values

  • This isn’t everything you can check for, but should cover most things

Data cleaning mnemonic#

  • Empty

  • Bad

  • Unique

  • Spread

Dealing with dtypes#

DtypeWarning: Columns (8,20,31,34) have mixed types.
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2                  299624 non-null  object 
 13  Intersection Street 1           107377 non-null  object 
 14  Intersection Street 2           107042 non-null  object 
 15  Address Type                    451006 non-null  object 
 16  City                            476632 non-null  object 
 17  Landmark                        32516 non-null   object 
 18  Facility Type                   134918 non-null  object 
 19  Status                          499958 non-null  object 
 20  Due Date                        171534 non-null  object 
 21  Resolution Description          457354 non-null  object 
 22  Resolution Action Updated Date  488788 non-null  object 
 23  Community Board                 499958 non-null  object 
 24  BBL                             407338 non-null  float64
 25  Borough                         499958 non-null  object 
 26  X Coordinate (State Plane)      470815 non-null  float64
 27  Y Coordinate (State Plane)      470815 non-null  float64
 28  Open Data Channel Type          499958 non-null  object 
 29  Park Facility Name              499931 non-null  object 
 30  Park Borough                    499958 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4474 non-null    object 
 34  Bridge Highway Name             696 non-null     object 
 35  Bridge Highway Direction        765 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1007 non-null    object 
 38  Latitude                        470815 non-null  float64
 39  Longitude                       470815 non-null  float64
 40  Location                        470815 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
list(requests["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 nan,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 11226.0,
 11236.0,
 10466.0,
 11210.0,
 10452.0,
 10457.0,
 10304.0,
 11416.0,
 11205.0,
 11214.0,
 11692.0,
 11101.0,
 11201.0,
 10470.0,
 11375.0,
 10305.0,
 11420.0,
 11421.0,
 11419.0,
 11412.0,
 10309.0,
 11385.0,
 11379.0,
 11356.0,
 10021.0,
 11417.0,
 11377.0,
 11423.0,
 10001.0,
 10469.0,
 10463.0,
 11372.0,
 11422.0,
 10017.0,
 11368.0,
 11213.0,
 11220.0,
 10312.0,
 10025.0,
 11357.0,
 11230.0,
 10456.0,
 11104.0,
 11232.0,
 11208.0,
 11206.0,
 11229.0,
 10465.0,
 10472.0,
 10039.0,
 10016.0,
 10009.0,
 11355.0,
 11211.0,
 10040.0,
 11366.0,
 11234.0,
 10013.0,
 11207.0,
 10032.0,
 10036.0,
 10029.0,
 11204.0,
 11435.0,
 11373.0,
 10007.0,
 11216.0,
 10306.0,
 11378.0,
 11414.0,
 10314.0,
 10128.0,
 10027.0,
 11221.0,
 10024.0,
 11364.0,
 10468.0,
 11233.0,
 10014.0,
 11238.0,
 11374.0,
 11105.0,
 11219.0,
 11106.0,
 10003.0,
 10026.0,
 11411.0,
 11223.0,
 11693.0,
 11365.0,
 11209.0,
 11103.0,
 11361.0,
 11427.0,
 10467.0,
 11203.0,
 11102.0,
 10455.0,
 10454.0,
 11225.0,
 11212.0,
 11434.0,
 10461.0,
 11369.0,
 11432.0,
 10031.0,
 11217.0,
 11235.0,
 10458.0,
 10453.0,
 11215.0,
 11367.0,
 11231.0,
 11413.0,
 11358.0,
 11694.0,
 10462.0,
 11218.0,
 11430.0,
 10473.0,
 10011.0,
 10460.0,
 10019.0,
 10310.0,
 10037.0,
 10451.0,
 10301.0,
 10010.0,
 11354.0,
 11222.0,
 11691.0,
 10034.0,
 10308.0,
 10012.0,
 11237.0,
 10801.0,
 10030.0,
 11428.0,
 11418.0,
 10033.0,
 10459.0,
 11429.0,
 10065.0,
 10022.0,
 10005.0,
 10002.0,
 10035.0,
 11370.0,
 10004.0,
 11415.0,
 11426.0,
 11362.0,
 10006.0,
 11360.0,
 10038.0,
 11249.0,
 11001.0,
 11224.0,
 10075.0,
 10018.0,
 10302.0,
 11436.0,
 11433.0,
 10474.0,
 10023.0,
 11004.0,
 10028.0,
 11228.0,
 10475.0,
 10069.0,
 11363.0,
 10303.0,
 10282.0,
 10307.0,
 10103.0,
 10280.0,
 10471.0,
 10271.0,
 10000.0,
 10464.0,
 10107.0,
 7078.0,
 10044.0,
 10020.0,
 56303.0,
 10119.0,
 11040.0,
 11386.0,
 10952.0,
 6811.0,
 11239.0,
 10121.0,
 37214.0,
 10538.0,
 10112.0,
 10279.0,
 11109.0,
 11371.0,
 18017.0,
 7115.0,
 77036.0,
 7114.0,
 '10123',
 '11030',
 '10801',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '10103',
 '07057',
 '10162',
 '11582',
 '10281',
 '10271',
 '10107',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '11371',
 '23450',
 11580.0,
 10591.0,
 11005.0,
 19034.0,
 11596.0,
 11779.0,
 7621.0,
 11021.0,
 11241.0,
 10169.0,
 100000.0,
 11242.0,
 10151.0,
 11697.0,
 7086.0,
 10177.0,
 10118.0,
 10105.0,
 10152.0,
 10168.0,
 7093.0,
 10917.0,
 10110.0,
 10153.0,
 10178.0,
 11570.0,
 10601.0,
 10704.0,
 7424.0,
 10281.0,
 10158.0,
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '10601',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 11359.0,
 11758.0,
 10278.0,
 43017.0,
 10154.0,
 11553.0,
 10162.0,
 11695.0,
 10041.0,
 11741.0,
 98335.0,
 14814.0,
 10111.0,
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10121',
 '10105',
 '10701',
 0.0,
 17106.0,
 979113.0,
 10120.0,
 12345.0,
 11030.0,
 11797.0,
 100.0,
 11710.0,
 33624.0,
 8682.0,
 11747.0,
 1757.0,
 11561.0,
 7304.0,
 6851.0,
 11590.0,
 94267.0,
 10167.0,
 11749.0,
 11756.0,
 10174.0,
 10550.0,
 89119.0,
 14068.0,
 11722.0,
 11520.0,
 6460.0,
 32255.0,
 10173.0,
 10165.0,
 11946.0,
 '29616-0759',
 '07032',
 '10278',
 '11575',
 11963.0,
 10106.0,
 83.0,
 11566.0,
 6870.0,
 7001.0,
 10710.0,
 11735.0,
 11572.0,
 '10165',
 '10279',
 '11251',
 'NJ 07114',
 '10106',
 '07666',
 '11516',
 '10177',
 '10170',
 '43215-1441',
 '00000',
 '08081',
 '10803',
 11507.0,
 11701.0,
 11563.0,
 7047.0,
 3108.0,
 '11021',
 '07003',
 '10152',
 '07029',
 '10041',
 '31093',
 '11735-0230',
 89118.0,
 11803.0,
 11559.0,
 11565.0,
 7080.0,
 12601.0,
 10155.0,
 10171.0,
 7208.0,
 11757.0,
 11042.0,
 '1101',
 '10111',
 '10173',
 '10096',
 '07087',
 'DID N',
 '10956',
 10048.0,
 10123.0,
 10122.0,
 11251.0]

ZIP codes look numeric, but aren’t really.

Read the ZIP codes in as strings.

requests = pd.read_csv(url, dtype={"Incident Zip": "string"})
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_7524/3312645178.py:1: DtypeWarning: Columns (20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(url, dtype={"Incident Zip": "string"})

We fixed the dtype warning for column 8 (Incident Zip).

list(requests["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 <NA>,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 '10801',
 '10103',
 '10271',
 '10107',
 '07078',
 '56303',
 '11386',
 '10952',
 '06811',
 '10121',
 '37214',
 '10538',
 '10279',
 '11371',
 '18017',
 '07115',
 '77036',
 '10123',
 '11030',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '07057',
 '10162',
 '11582',
 '10281',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '23450',
 '11580',
 '10591',
 '19034',
 '11596',
 '11779',
 '07621',
 '11021',
 '11241',
 '100000',
 '11242',
 '07086',
 '10177',
 '10105',
 '10152',
 '07093',
 '10917',
 '10110',
 '10153',
 '10178',
 '11570',
 '10601',
 '10704',
 '07424',
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 '11758',
 '10278',
 '43017',
 '10154',
 '11553',
 '11695',
 '10041',
 '11741',
 '98335',
 '14814',
 '10111',
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10701',
 '00000',
 '17106',
 '979113',
 '11797',
 '100',
 '11710',
 '33624',
 '8682',
 '11747',
 '01757',
 '11561',
 '07304',
 '000000',
 '06851',
 '94267',
 '10167',
 '11749',
 '11756',
 '10174',
 '10550',
 '89119',
 '14068',
 '11722',
 '06460',
 '32255',
 '10173',
 '10165',
 '11946',
 '29616-0759',
 '07032',
 '11575',
 '11963',
 '10106',
 '11566',
 '06870',
 '07001',
 '11251',
 'NJ 07114',
 '07666',
 '11516',
 '10170',
 '43215-1441',
 '08081',
 '10803',
 '11507',
 '11701',
 '11563',
 '07047',
 '03108',
 '07003',
 '07029',
 '31093',
 '11735-0230',
 '89118',
 '11803',
 '11559',
 '11565',
 '07080',
 '12601',
 '10171',
 '07208',
 '11757',
 '11042',
 '1101',
 '10096',
 '07087',
 'DID N',
 '10956',
 '10048',
 '10122']

Find invalid ZIP codes#

Use a regular expression (regex) to find strings that match a pattern:

^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string

regex101 is useful for testing them.

# find valid ZIP codes
valid_zips = requests["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")

# filter the DataFrame to only invalid ZIP codes
invalid_zips = ~valid_zips
requests_with_invalid_zips = requests[invalid_zips]
requests_with_invalid_zips["Incident Zip"]
55017     HARRISBURG
58100         N5X3A6
80798         100000
120304           IDK
123304          1801
173518     14614-195
192034        979113
201463           100
207158          8682
216745        000000
325071      NJ 07114
425985          1101
441166         DID N
Name: Incident Zip, dtype: string

Clear any invalid ZIP codes:

requests.loc[invalid_zips, "Incident Zip"] = None

Additonal data cleaning tips:

  • Hard part is finding what needs to be done

  • Will be specific to your use case

  • Document what you did, since it will affect your results

Which 311 complaints are most common and which agencies are responsible for handling them?#

Which complaints are the most common?#

requests.groupby("Complaint Type").size().to_frame(name="count")
count
Complaint Type
APPLIANCE 2539
Abandoned Vehicle 1655
Advocate - Other 26
Advocate-Co-opCondo Abatement 4
Advocate-Personal Exemptions 24
... ...
WATER LEAK 6641
Water Conservation 853
Water Quality 332
Water System 12949
Window Guard 6

213 rows × 1 columns

.to_frame(name='count') isn’t necessary, but it’s helpful to include because it allows us to name the new column that contains the count of rows.

Which agencies are responsible for handling these complaint categories?#

requests.groupby(["Agency", "Complaint Type"]).size().to_frame(name="count")
count
Agency Complaint Type
ACS Forms 56
COIB Forms 1
DCA Consumer Complaint 2892
DCA / DOH New License Application Request 186
DCAS Comments 13
... ... ...
TLC Lost Property 472
Taxi Complaint 2416
Taxi Compliment 41
Taxi Licensee Complaint 5
Taxi Report 318

221 rows × 1 columns

Which agencies receive the most total 311 requests?#

requests.groupby("Agency").size().to_frame(name="count")
count
Agency
ACS 56
COIB 1
DCA 3078
DCAS 149
DCP 4
DEP 36283
DFTA 1182
DHS 3670
DOB 26024
DOE 442
DOF 7223
DOHMH 12690
DOITT 115
DOT 52001
DPR 21086
DSNY 73227
EDC 303
HPD 105634
HRA 1649
NYCEM 25
NYPD 148884
TAT 1
TAX 37
TLC 6194

What is the most frequent request per agency?#

First, create a dataframe that contains the count of complaints per Agency per Complaint Type.

agency_counts = requests.groupby(["Agency", "Complaint Type"]).size().reset_index(name="count")
agency_counts.head(20)
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
3 DCA DCA / DOH New License Application Request 186
4 DCAS Comments 13
5 DCAS Question 136
6 DCP Research Questions 4
7 DEP Air Quality 1457
8 DEP Asbestos 316
9 DEP FATF 23
10 DEP Hazardous Materials 549
11 DEP Industrial Waste 263
12 DEP Lead 1825
13 DEP Noise 10937
14 DEP Plant 22
15 DEP Sewer 6757
16 DEP Water Conservation 853
17 DEP Water Quality 332
18 DEP Water System 12949
19 DFTA Alzheimer's Care 25

Use drop_duplicates() to keep the row with the highest value per Agency.

sorted_agency_counts = agency_counts.sort_values("count", ascending=False)
sorted_agency_counts
Agency Complaint Type count
198 NYPD Noise - Residential 41311
169 HPD HEAT/HOT WATER 39095
194 NYPD Illegal Parking 34297
154 DSNY Request Large Bulky Item Collection 30939
186 NYPD Blocked Driveway 25530
... ... ... ...
119 DOT Highway Sign - Dangling 3
94 DOHMH Radioactive Material 3
139 DPR Unsanitary Condition 1
207 TAT Question 1
1 COIB Forms 1

221 rows × 3 columns

top_agency_counts = sorted_agency_counts.drop_duplicates("Agency")
top_agency_counts
Agency Complaint Type count
198 NYPD Noise - Residential 41311
169 HPD HEAT/HOT WATER 39095
154 DSNY Request Large Bulky Item Collection 30939
124 DOT Street Condition 16895
18 DEP Water System 12949
42 DOB General Construction/Plumbing 9537
95 DOHMH Rodent 6253
132 DPR Damaged Tree 6065
29 DHS Homeless Person Assistance 3670
2 DCA Consumer Complaint 2892
212 TLC For Hire Vehicle Complaint 2789
67 DOF DOF Property - Reduction Issue 1846
180 HRA Benefit Card Replacement 1649
25 DFTA Housing - Low Income Senior 768
53 DOE School Maintenance 442
161 EDC Noise - Helicopter 303
5 DCAS Question 136
106 DOITT Public Payphone Complaint 75
0 ACS Forms 56
209 TAX Question 34
181 NYCEM OEM Literature Request 25
6 DCP Research Questions 4
207 TAT Question 1
1 COIB Forms 1

For nicer display, order alphabetically:

top_agency_counts.sort_values("Agency")
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
5 DCAS Question 136
6 DCP Research Questions 4
18 DEP Water System 12949
25 DFTA Housing - Low Income Senior 768
29 DHS Homeless Person Assistance 3670
42 DOB General Construction/Plumbing 9537
53 DOE School Maintenance 442
67 DOF DOF Property - Reduction Issue 1846
95 DOHMH Rodent 6253
106 DOITT Public Payphone Complaint 75
124 DOT Street Condition 16895
132 DPR Damaged Tree 6065
154 DSNY Request Large Bulky Item Collection 30939
161 EDC Noise - Helicopter 303
169 HPD HEAT/HOT WATER 39095
180 HRA Benefit Card Replacement 1649
181 NYCEM OEM Literature Request 25
198 NYPD Noise - Residential 41311
207 TAT Question 1
209 TAX Question 34
212 TLC For Hire Vehicle Complaint 2789

Another way, only sorting it once:

agency_counts.sort_values(["Agency", "count"], ascending=False)
Agency Complaint Type count
212 TLC For Hire Vehicle Complaint 2789
217 TLC Taxi Complaint 2416
216 TLC Lost Property 472
220 TLC Taxi Report 318
213 TLC For Hire Vehicle Report 112
... ... ... ...
4 DCAS Comments 13
2 DCA Consumer Complaint 2892
3 DCA DCA / DOH New License Application Request 186
1 COIB Forms 1
0 ACS Forms 56

221 rows × 3 columns

Projects#

In real/ideal world, start with specific question and find data to answer it:

project flow

Source: Big Data and Social Science

Data needed often doesn’t exist or is hard (or impossible) to find/access

project flow

This will apply to all our Projects.

Project 1#