Lecture 22: Dates and time series analysis, APIs#
Dates and time series analysis#
From Wikipedia:
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
What are time series you’ve seen / worked with?
Importing necessary packages#
import pandas as pd
import plotly.express as px
Data preparation#
Load 311 data.
requests_311 = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip")
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_36401/99667367.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests_311 = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip")
requests_311[["Created Date", "Closed Date"]]
Created Date | Closed Date | |
---|---|---|
0 | 08/01/2018 12:05:13 AM | 08/01/2018 12:05:13 AM |
1 | 08/01/2018 12:06:05 AM | 08/01/2018 12:06:05 AM |
2 | 08/01/2018 12:06:16 AM | 08/03/2018 02:03:55 PM |
3 | 08/01/2018 12:06:29 AM | 08/01/2018 02:54:24 AM |
4 | 08/01/2018 12:06:51 AM | 08/01/2018 04:54:26 AM |
... | ... | ... |
499953 | 08/24/2019 01:46:09 AM | NaN |
499954 | 08/24/2019 01:49:49 AM | NaN |
499955 | 08/24/2019 01:56:35 AM | NaN |
499956 | 08/24/2019 01:56:40 AM | NaN |
499957 | 08/24/2019 01:57:58 AM | NaN |
499958 rows × 2 columns
The dates are strings:
requests_311.dtypes
Unique Key int64
Created Date object
Closed Date object
Agency object
Agency Name object
Complaint Type object
Descriptor object
Location Type object
Incident Zip object
Incident Address object
Street Name object
Cross Street 1 object
Cross Street 2 object
Intersection Street 1 object
Intersection Street 2 object
Address Type object
City object
Landmark object
Facility Type object
Status object
Due Date object
Resolution Description object
Resolution Action Updated Date object
Community Board object
BBL float64
Borough object
X Coordinate (State Plane) float64
Y Coordinate (State Plane) float64
Open Data Channel Type object
Park Facility Name object
Park Borough object
Vehicle Type object
Taxi Company Borough object
Taxi Pick Up Location object
Bridge Highway Name object
Bridge Highway Direction object
Road Ramp object
Bridge Highway Segment object
Latitude float64
Longitude float64
Location object
dtype: object
Convert columns to timestamps using pandas’ to_datetime()
#
requests_311["Created Date"] = pd.to_datetime(requests_311["Created Date"], format="%m/%d/%Y %I:%M:%S %p")
requests_311["Closed Date"] = pd.to_datetime(requests_311["Closed Date"], format="%m/%d/%Y %I:%M:%S %p")
requests_311[["Created Date", "Closed Date"]]
Created Date | Closed Date | |
---|---|---|
0 | 2018-08-01 00:05:13 | 2018-08-01 00:05:13 |
1 | 2018-08-01 00:06:05 | 2018-08-01 00:06:05 |
2 | 2018-08-01 00:06:16 | 2018-08-03 14:03:55 |
3 | 2018-08-01 00:06:29 | 2018-08-01 02:54:24 |
4 | 2018-08-01 00:06:51 | 2018-08-01 04:54:26 |
... | ... | ... |
499953 | 2019-08-24 01:46:09 | NaT |
499954 | 2019-08-24 01:49:49 | NaT |
499955 | 2019-08-24 01:56:35 | NaT |
499956 | 2019-08-24 01:56:40 | NaT |
499957 | 2019-08-24 01:57:58 | NaT |
499958 rows × 2 columns
More about the format
string. If you don’t provide one, it will take much longer to convert.
Check data types and confirm they are now datetime:
requests_311.dtypes
Unique Key int64
Created Date datetime64[ns]
Closed Date datetime64[ns]
Agency object
Agency Name object
Complaint Type object
Descriptor object
Location Type object
Incident Zip object
Incident Address object
Street Name object
Cross Street 1 object
Cross Street 2 object
Intersection Street 1 object
Intersection Street 2 object
Address Type object
City object
Landmark object
Facility Type object
Status object
Due Date object
Resolution Description object
Resolution Action Updated Date object
Community Board object
BBL float64
Borough object
X Coordinate (State Plane) float64
Y Coordinate (State Plane) float64
Open Data Channel Type object
Park Facility Name object
Park Borough object
Vehicle Type object
Taxi Company Borough object
Taxi Pick Up Location object
Bridge Highway Name object
Bridge Highway Direction object
Road Ramp object
Bridge Highway Segment object
Latitude float64
Longitude float64
Location object
dtype: object
Noise complaints per day#
noise = requests_311[requests_311["Complaint Type"] == "Noise - Residential"]
noise_per_day = noise.resample("D", on="Created Date").size().reset_index(name="count_requests")
noise_per_day
Created Date | count_requests | |
---|---|---|
0 | 2018-08-01 | 50 |
1 | 2018-08-02 | 49 |
2 | 2018-08-03 | 65 |
3 | 2018-08-04 | 162 |
4 | 2018-08-05 | 191 |
... | ... | ... |
384 | 2019-08-20 | 60 |
385 | 2019-08-21 | 46 |
386 | 2019-08-22 | 62 |
387 | 2019-08-23 | 109 |
388 | 2019-08-24 | 33 |
389 rows × 2 columns
Resampling#
Once you have a column with datetime objects, pandas can manipulate them directly. From the User Guide:
resample()
is a time-basedgroupby
.resample('D', on='Created Date')
The 'D'
is the period alias, i.e. the desired frequency.
fig = px.line(
noise_per_day,
x="Created Date",
y="count_requests",
title="Noise complaints per day",
)
fig.show()
Let’s try weekly:
noise_per_week = noise.resample("W", on="Created Date").size().reset_index(name="count_requests")
fig = px.line(
noise_per_week,
x="Created Date",
y="count_requests",
title="Noise complaints per week",
)
fig.show()
What’s the average duration?#
# calculate the amount of time that passed between Created Date and Closed Date
requests_311["resolution_duration"] = requests_311["Closed Date"] - requests_311["Created Date"]
# print head to check results
requests_311[["Closed Date", "Created Date", "resolution_duration"]].head()
Closed Date | Created Date | resolution_duration | |
---|---|---|---|
0 | 2018-08-01 00:05:13 | 2018-08-01 00:05:13 | 0 days 00:00:00 |
1 | 2018-08-01 00:06:05 | 2018-08-01 00:06:05 | 0 days 00:00:00 |
2 | 2018-08-03 14:03:55 | 2018-08-01 00:06:16 | 2 days 13:57:39 |
3 | 2018-08-01 02:54:24 | 2018-08-01 00:06:29 | 0 days 02:47:55 |
4 | 2018-08-01 04:54:26 | 2018-08-01 00:06:51 | 0 days 04:47:35 |
requests_311.dtypes
Unique Key int64
Created Date datetime64[ns]
Closed Date datetime64[ns]
Agency object
Agency Name object
Complaint Type object
Descriptor object
Location Type object
Incident Zip object
Incident Address object
Street Name object
Cross Street 1 object
Cross Street 2 object
Intersection Street 1 object
Intersection Street 2 object
Address Type object
City object
Landmark object
Facility Type object
Status object
Due Date object
Resolution Description object
Resolution Action Updated Date object
Community Board object
BBL float64
Borough object
X Coordinate (State Plane) float64
Y Coordinate (State Plane) float64
Open Data Channel Type object
Park Facility Name object
Park Borough object
Vehicle Type object
Taxi Company Borough object
Taxi Pick Up Location object
Bridge Highway Name object
Bridge Highway Direction object
Road Ramp object
Bridge Highway Segment object
Latitude float64
Longitude float64
Location object
resolution_duration timedelta64[ns]
dtype: object
requests_311["resolution_duration"].mean()
Timedelta('7 days 22:46:26.962985676')
requests_311["resolution_duration"].median()
Timedelta('1 days 05:19:58')
Filtering timestamps#
Noise complaints over New Year’s.
requests_311
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | resolution_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39885889 | 2018-08-01 00:05:13 | 2018-08-01 00:05:13 | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3143 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.585156 | -73.959119 | (40.585155533520144, -73.95911915841708) | 0 days 00:00:00 |
1 | 39886470 | 2018-08-01 00:06:05 | 2018-08-01 00:06:05 | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3153 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.585218 | -73.958608 | (40.58521848090658, -73.95860788382927) | 0 days 00:00:00 |
2 | 39893543 | 2018-08-01 00:06:16 | 2018-08-03 14:03:55 | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | ENTIRE BUILDING | RESIDENTIAL BUILDING | 11221 | 729 LAFAYETTE AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.690733 | -73.943964 | (40.69073285353906, -73.943963521266) | 2 days 13:57:39 |
3 | 39886233 | 2018-08-01 00:06:29 | 2018-08-01 02:54:24 | NYPD | New York City Police Department | Noise - Residential | Banging/Pounding | Residential Building/House | 11693 | 82-01 BEACH CHANNEL DRIVE | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.589931 | -73.808896 | (40.58993080750793, -73.80889570815852) | 0 days 02:47:55 |
4 | 39880309 | 2018-08-01 00:06:51 | 2018-08-01 04:54:26 | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11216 | 64 HERKIMER STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.679716 | -73.951234 | (40.67971590505359, -73.95123396494363) | 0 days 04:47:35 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499953 | 43622302 | 2019-08-24 01:46:09 | NaT | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10009.0 | 431 EAST 9 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.727536 | -73.983295 | (40.72753608835362, -73.98329522742081) | NaT |
499954 | 43619709 | 2019-08-24 01:49:49 | NaT | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10304.0 | 191 BROAD STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.624157 | -74.081006 | (40.62415703282506, -74.08100614362155) | NaT |
499955 | 43623124 | 2019-08-24 01:56:35 | NaT | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Music/Party | Street/Sidewalk | 10031.0 | 534 WEST 153 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.830718 | -73.945006 | (40.83071800761314, -73.94500557250639) | NaT |
499956 | 43625595 | 2019-08-24 01:56:40 | NaT | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10452.0 | EAST 170 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.839882 | -73.916783 | (40.839882158779105, -73.91678321635897) | NaT |
499957 | 43622817 | 2019-08-24 01:57:58 | NaT | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Store/Commercial | 10033.0 | 247 AUDUBON AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | 40.846376 | -73.934048 | (40.84637632367179, -73.93404825809533) | NaT |
499958 rows × 42 columns
after_dec_31 = requests_311["Created Date"] >= pd.Timestamp(2018, 12, 31)
before_jan_2 = requests_311["Created Date"] < pd.Timestamp(2019, 1, 2)
residential_only = requests_311["Complaint Type"] == "Noise - Residential"
condition = after_dec_31 & before_jan_2 & residential_only
new_years_noise = requests_311[condition]
display_columns = ["Created Date", "Complaint Type"]
new_years_noise[display_columns]
Created Date | Complaint Type | |
---|---|---|
201471 | 2018-12-31 00:03:37 | Noise - Residential |
201473 | 2018-12-31 00:05:12 | Noise - Residential |
201475 | 2018-12-31 00:07:00 | Noise - Residential |
201478 | 2018-12-31 00:11:18 | Noise - Residential |
201481 | 2018-12-31 00:19:44 | Noise - Residential |
... | ... | ... |
203413 | 2019-01-01 22:59:22 | Noise - Residential |
203415 | 2019-01-01 23:03:03 | Noise - Residential |
203416 | 2019-01-01 23:04:27 | Noise - Residential |
203420 | 2019-01-01 23:10:46 | Noise - Residential |
203435 | 2019-01-01 23:56:22 | Noise - Residential |
321 rows × 2 columns
You can also filter dates using strings, but using the Timestamp
class is safer.
APIs#
They are very powerful
Can be used from any programming language
APIs, conceptually#
How does booking a flight work?
interactions between systems ↔️
Ways to get data#
Method |
How it happens |
Pros |
Cons |
---|---|---|---|
Bulk |
Download, someone hands you a flash drive, etc. |
Fast, one-time transfer |
Can be large; data gets out of date easily |
APIs |
If organization makes one available |
Usually allows some filtering; can always pull latest-and-greatest |
Requires network connection for every call; higher barrier to entry (reading documentation, access); subject to availability and performance of API |
Scraping |
Data only available through a web site, PDF, or doc |
You can turn anything into data |
Tedious; fragile |
Data is only available if it’s available#
Please pray to the Demo Gods that these all work and there’s no profanity
API calls in the wild#
Go to Candidates page on fec.gov.
Right click and
Inspect
.Go to the
Network
tab and reload.Filter to
XHR
.Click the API call.
We only see this because the tables on fec.gov are rendered client-side using their JSON API. That won’t be the case for all tables on all sites.
Parts of a URL#
For APIs:
Often split into “base URL” + “endpoint”
Endpoints are like function names: they represent the information you are retrieving or thing you are trying to do
Parameters are like function arguments:
They allow options to be specified
Some are required, some are optional
They will differ from one endpoint/function to another
Anchors won’t be used
API documentation#
Go to Population By Borough dataset and get the API endpoint.
API calls from Python#
Usually one of two ways:
A software development kit (SDK) like sodapy
Abstracts the details away
Not available for all APIs
May have limitations
-
Nothing to do with 311 requests
import requests
response = requests.get("https://data.cityofnewyork.us/resource/xywu-7bv9.json")
data = response.json()
data
[{'age_group': 'Total Population',
'borough': 'NYC Total',
'_1950': '7891957',
'_1950_boro_share_of_nyc_total': '100.00',
'_1960': '7781984',
'_1960_boro_share_of_nyc_total': '100.00',
'_1970': '7894862',
'_1970_boro_share_of_nyc_total': '100.00',
'_1980': '7071639',
'_1980_boro_share_of_nyc_total': '100.00',
'_1990': '7322564',
'_1990_boro_share_of_nyc_total': '100.00',
'_2000': '8008278',
'_2000_boro_share_of_nyc_total': '100.00',
'_2010': '8242624',
'_2010_boro_share_of_nyc_total': '100.00',
'_2020': '8550971',
'_2020_boro_share_of_nyc_total': '100.00',
'_2030': '8821027',
'_2030_boro_share_of_nyc_total': '100.00',
'_2040': '9025145',
'_2040_boro_share_of_nyc_total': '100.00'},
{'age_group': 'Total Population',
'borough': ' Bronx',
'_1950': '1451277',
'_1950_boro_share_of_nyc_total': '18.39',
'_1960': '1424815',
'_1960_boro_share_of_nyc_total': '18.31',
'_1970': '1471701',
'_1970_boro_share_of_nyc_total': '18.64',
'_1980': '1168972',
'_1980_boro_share_of_nyc_total': '16.53',
'_1990': '1203789',
'_1990_boro_share_of_nyc_total': '16.44',
'_2000': '1332650',
'_2000_boro_share_of_nyc_total': '16.64',
'_2010': '1385108',
'_2010_boro_share_of_nyc_total': '16.80',
'_2020': '1446788',
'_2020_boro_share_of_nyc_total': '16.92',
'_2030': '1518998',
'_2030_boro_share_of_nyc_total': '17.22',
'_2040': '1579245',
'_2040_boro_share_of_nyc_total': '17.50'},
{'age_group': 'Total Population',
'borough': ' Brooklyn',
'_1950': '2738175',
'_1950_boro_share_of_nyc_total': '34.70',
'_1960': '2627319',
'_1960_boro_share_of_nyc_total': '33.76',
'_1970': '2602012',
'_1970_boro_share_of_nyc_total': '32.96',
'_1980': '2230936',
'_1980_boro_share_of_nyc_total': '31.55',
'_1990': '2300664',
'_1990_boro_share_of_nyc_total': '31.42',
'_2000': '2465326',
'_2000_boro_share_of_nyc_total': '30.78',
'_2010': '2552911',
'_2010_boro_share_of_nyc_total': '30.97',
'_2020': '2648452',
'_2020_boro_share_of_nyc_total': '30.97',
'_2030': '2754009',
'_2030_boro_share_of_nyc_total': '31.22',
'_2040': '2840525',
'_2040_boro_share_of_nyc_total': '31.47'},
{'age_group': 'Total Population',
'borough': ' Manhattan',
'_1950': '1960101',
'_1950_boro_share_of_nyc_total': '24.84',
'_1960': '1698281',
'_1960_boro_share_of_nyc_total': '21.82',
'_1970': '1539233',
'_1970_boro_share_of_nyc_total': '19.50',
'_1980': '1428285',
'_1980_boro_share_of_nyc_total': '20.20',
'_1990': '1487536',
'_1990_boro_share_of_nyc_total': '20.31',
'_2000': '1537195',
'_2000_boro_share_of_nyc_total': '19.20',
'_2010': '1585873',
'_2010_boro_share_of_nyc_total': '19.24',
'_2020': '1638281',
'_2020_boro_share_of_nyc_total': '19.16',
'_2030': '1676720',
'_2030_boro_share_of_nyc_total': '19.01',
'_2040': '1691617',
'_2040_boro_share_of_nyc_total': '18.74'},
{'age_group': 'Total Population',
'borough': ' Queens',
'_1950': '1550849',
'_1950_boro_share_of_nyc_total': '19.65',
'_1960': '1809578',
'_1960_boro_share_of_nyc_total': '23.25',
'_1970': '1986473',
'_1970_boro_share_of_nyc_total': '25.16',
'_1980': '1891325',
'_1980_boro_share_of_nyc_total': '26.75',
'_1990': '1951598',
'_1990_boro_share_of_nyc_total': '26.65',
'_2000': '2229379',
'_2000_boro_share_of_nyc_total': '27.84',
'_2010': '2250002',
'_2010_boro_share_of_nyc_total': '27.30',
'_2020': '2330295',
'_2020_boro_share_of_nyc_total': '27.25',
'_2030': '2373551',
'_2030_boro_share_of_nyc_total': '26.91',
'_2040': '2412649',
'_2040_boro_share_of_nyc_total': '26.73'},
{'age_group': 'Total Population',
'borough': ' Staten Island',
'_1950': '191555',
'_1950_boro_share_of_nyc_total': '2.43',
'_1960': '221991',
'_1960_boro_share_of_nyc_total': '2.85',
'_1970': '295443',
'_1970_boro_share_of_nyc_total': '3.74',
'_1980': '352121',
'_1980_boro_share_of_nyc_total': '4.98',
'_1990': '378977',
'_1990_boro_share_of_nyc_total': '5.18',
'_2000': '443728',
'_2000_boro_share_of_nyc_total': '5.54',
'_2010': '468730',
'_2010_boro_share_of_nyc_total': '5.69',
'_2020': '487155',
'_2020_boro_share_of_nyc_total': '5.70',
'_2030': '497749',
'_2030_boro_share_of_nyc_total': '5.64',
'_2040': '501109',
'_2040_boro_share_of_nyc_total': '5.55'}]
Reading into a DataFrame#
pd.DataFrame(data)
age_group | borough | _1950 | _1950_boro_share_of_nyc_total | _1960 | _1960_boro_share_of_nyc_total | _1970 | _1970_boro_share_of_nyc_total | _1980 | _1980_boro_share_of_nyc_total | ... | _2000 | _2000_boro_share_of_nyc_total | _2010 | _2010_boro_share_of_nyc_total | _2020 | _2020_boro_share_of_nyc_total | _2030 | _2030_boro_share_of_nyc_total | _2040 | _2040_boro_share_of_nyc_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Total Population | NYC Total | 7891957 | 100.00 | 7781984 | 100.00 | 7894862 | 100.00 | 7071639 | 100.00 | ... | 8008278 | 100.00 | 8242624 | 100.00 | 8550971 | 100.00 | 8821027 | 100.00 | 9025145 | 100.00 |
1 | Total Population | Bronx | 1451277 | 18.39 | 1424815 | 18.31 | 1471701 | 18.64 | 1168972 | 16.53 | ... | 1332650 | 16.64 | 1385108 | 16.80 | 1446788 | 16.92 | 1518998 | 17.22 | 1579245 | 17.50 |
2 | Total Population | Brooklyn | 2738175 | 34.70 | 2627319 | 33.76 | 2602012 | 32.96 | 2230936 | 31.55 | ... | 2465326 | 30.78 | 2552911 | 30.97 | 2648452 | 30.97 | 2754009 | 31.22 | 2840525 | 31.47 |
3 | Total Population | Manhattan | 1960101 | 24.84 | 1698281 | 21.82 | 1539233 | 19.50 | 1428285 | 20.20 | ... | 1537195 | 19.20 | 1585873 | 19.24 | 1638281 | 19.16 | 1676720 | 19.01 | 1691617 | 18.74 |
4 | Total Population | Queens | 1550849 | 19.65 | 1809578 | 23.25 | 1986473 | 25.16 | 1891325 | 26.75 | ... | 2229379 | 27.84 | 2250002 | 27.30 | 2330295 | 27.25 | 2373551 | 26.91 | 2412649 | 26.73 |
5 | Total Population | Staten Island | 191555 | 2.43 | 221991 | 2.85 | 295443 | 3.74 | 352121 | 4.98 | ... | 443728 | 5.54 | 468730 | 5.69 | 487155 | 5.70 | 497749 | 5.64 | 501109 | 5.55 |
6 rows × 22 columns
Most open data sites have APIs#
Often built on platforms that provide them, e.g.
NYC Open Data Portal built on ~~Socrata~~ Tyler Data & Insights
Their API is still called the Socrata Open Data API (SODA)
Pagination#
Most APIs limit the number of results returned.
You’ll work with this in Lab 12.
Things are going to differ by API#
Endpoints
Supported parameters
Response structure
json_normalize()
can help
Quality of documentation
Helpfulness of errors
Size/helpfulness of community
Gotta read and experiment.