Lecture 20: Data visualization - Computing in Context (SIPA)

“Data visualization”, “chart”, “graph”, and will be used interchangeably.

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

Start by importing necessary packages¶

import pandas as pd
import plotly.express as px

<frozen importlib._bootstrap>:491: RuntimeWarning:

The global interpreter lock (GIL) has been enabled to load module 'pandas._libs.pandas_parser', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.

Populations¶

population = pd.read_csv("https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv")
population.head()

How can we display this graphically?

Are there other options?

Which is best?

Histograms¶

What’s a histogram?

What makes it different from a bar chart?

fig = px.histogram(
    population,
    x="Borough",
    title="Number of community districts in each borough",
    height=300,
)
fig.show()

Why isn’t there a y?

fig = px.histogram(
    population,
    x="2010 Population",
    title="Distribution of Community District populations, 2010",
    height=300,
    nbins=20,
)
fig.show()

What is the nbins doing?

In-class exercise¶

How would we make a table of number of community districts per borough?

# code here

How would we calculate the average community district population by borough?

# code here

Data from where we left off last class¶

Derived dataset containing count of complaints and populations of each community district.

districts = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/community_district_311.csv.zip")
districts.head()

Looking at raw volume is probably less useful than density.

Calculate 311 requests per capita¶

Divide request count by 2010 population to get requests per capita

districts["requests_per_capita"] = districts["num_311_requests"] / districts["2010 Population"]

districts.head()

Let’s create a simplified new dataframe that only include the columns we care about and in a better order.

columns = [
    "boro_cd",
    "Borough",
    "CD Name",
    "2010 Population",
    "num_311_requests",
    "requests_per_capita",
]
cd_data = districts[columns]

cd_data

Let’s check out which Community Districts have the highest complaints per capita

cd_data.sort_values("requests_per_capita", ascending=False).head(10)

While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it’s low residential population.

# cd_data.to_csv("data/311_community_districts.csv", index=False)

How does the per-capita distribution compare to that of the raw counts?

fig = px.histogram(districts, x="requests_per_capita", height=200)
fig.show()

fig = px.histogram(districts, x="num_311_requests", height=200)
fig.show()

Let’s improve the formatting (based on the .histogram() documentation):

fig = px.histogram(
    districts,
    x="requests_per_capita",
    title="Volume of 311 requests, 2018-2019",
    labels={"requests_per_capita": "311 requests per capita"},
)

# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text="Number of community districts")
fig.show()

Scatterplot¶

fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
)

fig.show()

Add a trendline:

fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
    trendline="ols",
)

fig.show()

Let’s take a look at the statistical summary, via the statsmodels package, following Plotly’s example:

trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()

“In general, the higher the R-squared, the better the model fits your data.”

Lots of types of charts¶

Important things are knowing:

What common chart types are called, so you can search how to make them
Why you’d pick one versus another

Chart hygiene¶

Always include a title.
Make sure you label dependent and independent variables (X and Y axes).
Consider whether you’re working with continuous vs. discrete values.
Less is more.
- If you’re trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying.