Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

Dates and time series analysis

From Wikipedia:

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

What are time series you’ve seen / worked with?

Importing necessary packages

import pandas as pd
import plotly.express as px
<frozen importlib._bootstrap>:491: RuntimeWarning:

The global interpreter lock (GIL) has been enabled to load module 'pandas._libs.pandas_parser', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.

Data preparation

Load 311 data.

requests_311 = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip")
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_93083/99667367.py:1: DtypeWarning:

Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.

requests_311[["Created Date", "Closed Date"]]
Loading...

The dates are strings:

requests_311.dtypes
Unique Key int64 Created Date object Closed Date object Agency object Agency Name object Complaint Type object Descriptor object Location Type object Incident Zip object Incident Address object Street Name object Cross Street 1 object Cross Street 2 object Intersection Street 1 object Intersection Street 2 object Address Type object City object Landmark object Facility Type object Status object Due Date object Resolution Description object Resolution Action Updated Date object Community Board object BBL float64 Borough object X Coordinate (State Plane) float64 Y Coordinate (State Plane) float64 Open Data Channel Type object Park Facility Name object Park Borough object Vehicle Type object Taxi Company Borough object Taxi Pick Up Location object Bridge Highway Name object Bridge Highway Direction object Road Ramp object Bridge Highway Segment object Latitude float64 Longitude float64 Location object dtype: object

Convert columns to timestamps using pandas’ to_datetime()

requests_311["Created Date"] = pd.to_datetime(requests_311["Created Date"], format="%m/%d/%Y %I:%M:%S %p")
requests_311["Closed Date"] = pd.to_datetime(requests_311["Closed Date"], format="%m/%d/%Y %I:%M:%S %p")
requests_311[["Created Date", "Closed Date"]]
Loading...

More about the format string. If you don’t provide one, it will take much longer to convert.

Check data types and confirm they are now datetime:

requests_311.dtypes
Unique Key int64 Created Date datetime64[ns] Closed Date datetime64[ns] Agency object Agency Name object Complaint Type object Descriptor object Location Type object Incident Zip object Incident Address object Street Name object Cross Street 1 object Cross Street 2 object Intersection Street 1 object Intersection Street 2 object Address Type object City object Landmark object Facility Type object Status object Due Date object Resolution Description object Resolution Action Updated Date object Community Board object BBL float64 Borough object X Coordinate (State Plane) float64 Y Coordinate (State Plane) float64 Open Data Channel Type object Park Facility Name object Park Borough object Vehicle Type object Taxi Company Borough object Taxi Pick Up Location object Bridge Highway Name object Bridge Highway Direction object Road Ramp object Bridge Highway Segment object Latitude float64 Longitude float64 Location object dtype: object

Noise complaints per day

noise = requests_311[requests_311["Complaint Type"] == "Noise - Residential"]
noise_per_day = noise.resample("D", on="Created Date").size().reset_index(name="count_requests")

noise_per_day
Loading...

Resampling

Once you have a column with datetime objects, pandas can manipulate them directly. From the User Guide:

resample() is a time-based groupby

.resample('D', on='Created Date')

The 'D' is the period alias, i.e. the desired frequency.

fig = px.line(
    noise_per_day,
    x="Created Date",
    y="count_requests",
    title="Noise complaints per day",
)
fig.show()
Loading...
Loading...

Let’s try weekly:

noise_per_week = noise.resample("W", on="Created Date").size().reset_index(name="count_requests")

fig = px.line(
    noise_per_week,
    x="Created Date",
    y="count_requests",
    title="Noise complaints per week",
)
fig.show()
Loading...

What’s the average duration?

# calculate the amount of time that passed between Created Date and Closed Date

requests_311["resolution_duration"] = requests_311["Closed Date"] - requests_311["Created Date"]

# print head to check results

requests_311[["Closed Date", "Created Date", "resolution_duration"]].head()
Loading...
requests_311.dtypes
Unique Key int64 Created Date datetime64[ns] Closed Date datetime64[ns] Agency object Agency Name object Complaint Type object Descriptor object Location Type object Incident Zip object Incident Address object Street Name object Cross Street 1 object Cross Street 2 object Intersection Street 1 object Intersection Street 2 object Address Type object City object Landmark object Facility Type object Status object Due Date object Resolution Description object Resolution Action Updated Date object Community Board object BBL float64 Borough object X Coordinate (State Plane) float64 Y Coordinate (State Plane) float64 Open Data Channel Type object Park Facility Name object Park Borough object Vehicle Type object Taxi Company Borough object Taxi Pick Up Location object Bridge Highway Name object Bridge Highway Direction object Road Ramp object Bridge Highway Segment object Latitude float64 Longitude float64 Location object resolution_duration timedelta64[ns] dtype: object
requests_311["resolution_duration"].mean()
Timedelta('7 days 22:46:26.962985676')
requests_311["resolution_duration"].median()
Timedelta('1 days 05:19:58')

Filtering timestamps

Noise complaints over New Year’s.

requests_311
Loading...
after_dec_31 = requests_311["Created Date"] >= pd.Timestamp(2018, 12, 31)
before_jan_2 = requests_311["Created Date"] < pd.Timestamp(2019, 1, 2)
residential_only = requests_311["Complaint Type"] == "Noise - Residential"

condition = after_dec_31 & before_jan_2 & residential_only
new_years_noise = requests_311[condition]

display_columns = ["Created Date", "Complaint Type"]
new_years_noise[display_columns]
Loading...

You can also filter dates using strings, but using the Timestamp class is safer.

APIs

  • They are very powerful

  • Can be used from any programming language

APIs, conceptually

How does booking a flight work?

Diagram showing how online payments work: Expedia talks to Delta, Delta talks to Stripe, Stripe talks to Visa, and Visa talks to Chase
Diagram showing how notifications flow through systems
Diagram showing relationship between human languages, programming languages, and APIs

interactions between systems ↔️

Ways to get data

MethodHow it happensProsCons
BulkDownload, someone hands you a flash drive, etc.Fast, one-time transferCan be large; data gets out of date easily
APIsIf organization makes one availableUsually allows some filtering; can always pull latest-and-greatestRequires network connection for every call; higher barrier to entry (reading documentation, access); subject to availability and performance of API
ScrapingData only available through a web site, PDF, or docYou can turn anything into dataTedious; fragile

Data is only available if it’s available

Please pray to the Demo Gods that these all work and there’s no profanity

API calls in the wild

  1. Go to Candidates page on fec.gov.

  2. Right click and Inspect.

  3. Go to the Network tab and reload.

  4. Filter to XHR.

  5. Click the API call.

We only see this because the tables on fec.gov are rendered client-side using their JSON API. That won’t be the case for all tables on all sites.

Parts of a URL

URL structure

source

For APIs:

  • Often split into “base URL” + “endpoint”

  • Endpoints are like function names: they represent the information you are retrieving or thing you are trying to do

  • Parameters are like function arguments:

    • They allow options to be specified

    • Some are required, some are optional

    • They will differ from one endpoint/function to another

  • Anchors won’t be used

API documentation

FEC API

Go to Population By Borough dataset and get the API endpoint.

API calls from Python

Usually one of two ways:

  • A software development kit (SDK) like sodapy

    • Abstracts the details away

    • Not available for all APIs

    • May have limitations

  • The requests package

    • Nothing to do with 311 requests

import requests

response = requests.get("https://data.cityofnewyork.us/resource/xywu-7bv9.json")
data = response.json()

data
[{'age_group': 'Total Population', 'borough': 'NYC Total', '_1950': '7891957', '_1950_boro_share_of_nyc_total': '100.00', '_1960': '7781984', '_1960_boro_share_of_nyc_total': '100.00', '_1970': '7894862', '_1970_boro_share_of_nyc_total': '100.00', '_1980': '7071639', '_1980_boro_share_of_nyc_total': '100.00', '_1990': '7322564', '_1990_boro_share_of_nyc_total': '100.00', '_2000': '8008278', '_2000_boro_share_of_nyc_total': '100.00', '_2010': '8242624', '_2010_boro_share_of_nyc_total': '100.00', '_2020': '8550971', '_2020_boro_share_of_nyc_total': '100.00', '_2030': '8821027', '_2030_boro_share_of_nyc_total': '100.00', '_2040': '9025145', '_2040_boro_share_of_nyc_total': '100.00'}, {'age_group': 'Total Population', 'borough': ' Bronx', '_1950': '1451277', '_1950_boro_share_of_nyc_total': '18.39', '_1960': '1424815', '_1960_boro_share_of_nyc_total': '18.31', '_1970': '1471701', '_1970_boro_share_of_nyc_total': '18.64', '_1980': '1168972', '_1980_boro_share_of_nyc_total': '16.53', '_1990': '1203789', '_1990_boro_share_of_nyc_total': '16.44', '_2000': '1332650', '_2000_boro_share_of_nyc_total': '16.64', '_2010': '1385108', '_2010_boro_share_of_nyc_total': '16.80', '_2020': '1446788', '_2020_boro_share_of_nyc_total': '16.92', '_2030': '1518998', '_2030_boro_share_of_nyc_total': '17.22', '_2040': '1579245', '_2040_boro_share_of_nyc_total': '17.50'}, {'age_group': 'Total Population', 'borough': ' Brooklyn', '_1950': '2738175', '_1950_boro_share_of_nyc_total': '34.70', '_1960': '2627319', '_1960_boro_share_of_nyc_total': '33.76', '_1970': '2602012', '_1970_boro_share_of_nyc_total': '32.96', '_1980': '2230936', '_1980_boro_share_of_nyc_total': '31.55', '_1990': '2300664', '_1990_boro_share_of_nyc_total': '31.42', '_2000': '2465326', '_2000_boro_share_of_nyc_total': '30.78', '_2010': '2552911', '_2010_boro_share_of_nyc_total': '30.97', '_2020': '2648452', '_2020_boro_share_of_nyc_total': '30.97', '_2030': '2754009', '_2030_boro_share_of_nyc_total': '31.22', '_2040': '2840525', '_2040_boro_share_of_nyc_total': '31.47'}, {'age_group': 'Total Population', 'borough': ' Manhattan', '_1950': '1960101', '_1950_boro_share_of_nyc_total': '24.84', '_1960': '1698281', '_1960_boro_share_of_nyc_total': '21.82', '_1970': '1539233', '_1970_boro_share_of_nyc_total': '19.50', '_1980': '1428285', '_1980_boro_share_of_nyc_total': '20.20', '_1990': '1487536', '_1990_boro_share_of_nyc_total': '20.31', '_2000': '1537195', '_2000_boro_share_of_nyc_total': '19.20', '_2010': '1585873', '_2010_boro_share_of_nyc_total': '19.24', '_2020': '1638281', '_2020_boro_share_of_nyc_total': '19.16', '_2030': '1676720', '_2030_boro_share_of_nyc_total': '19.01', '_2040': '1691617', '_2040_boro_share_of_nyc_total': '18.74'}, {'age_group': 'Total Population', 'borough': ' Queens', '_1950': '1550849', '_1950_boro_share_of_nyc_total': '19.65', '_1960': '1809578', '_1960_boro_share_of_nyc_total': '23.25', '_1970': '1986473', '_1970_boro_share_of_nyc_total': '25.16', '_1980': '1891325', '_1980_boro_share_of_nyc_total': '26.75', '_1990': '1951598', '_1990_boro_share_of_nyc_total': '26.65', '_2000': '2229379', '_2000_boro_share_of_nyc_total': '27.84', '_2010': '2250002', '_2010_boro_share_of_nyc_total': '27.30', '_2020': '2330295', '_2020_boro_share_of_nyc_total': '27.25', '_2030': '2373551', '_2030_boro_share_of_nyc_total': '26.91', '_2040': '2412649', '_2040_boro_share_of_nyc_total': '26.73'}, {'age_group': 'Total Population', 'borough': ' Staten Island', '_1950': '191555', '_1950_boro_share_of_nyc_total': '2.43', '_1960': '221991', '_1960_boro_share_of_nyc_total': '2.85', '_1970': '295443', '_1970_boro_share_of_nyc_total': '3.74', '_1980': '352121', '_1980_boro_share_of_nyc_total': '4.98', '_1990': '378977', '_1990_boro_share_of_nyc_total': '5.18', '_2000': '443728', '_2000_boro_share_of_nyc_total': '5.54', '_2010': '468730', '_2010_boro_share_of_nyc_total': '5.69', '_2020': '487155', '_2020_boro_share_of_nyc_total': '5.70', '_2030': '497749', '_2030_boro_share_of_nyc_total': '5.64', '_2040': '501109', '_2040_boro_share_of_nyc_total': '5.55'}]

Reading into a DataFrame

pd.DataFrame(data)
Loading...

Most open data sites have APIs

Often built on platforms that provide them, e.g.

Pagination

Things are going to differ by API

  • Endpoints

  • Supported parameters

  • Response structure

  • Quality of documentation

  • Helpfulness of errors

  • Size/helpfulness of community

Gotta read and experiment.