Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

“Data visualization”, “chart”, “graph”, and will be used interchangeably.

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

Start by importing necessary packages

import pandas as pd
import plotly.express as px
<frozen importlib._bootstrap>:491: RuntimeWarning:

The global interpreter lock (GIL) has been enabled to load module 'pandas._libs.pandas_parser', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.

Populations

population = pd.read_csv("https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv")
population.head()
Loading...

How can we display this graphically?

Are there other options?

Which is best?

Histograms

What’s a histogram?

What makes it different from a bar chart?

fig = px.histogram(
    population,
    x="Borough",
    title="Number of community districts in each borough",
    height=300,
)
fig.show()
Loading...
Loading...

Why isn’t there a y?

fig = px.histogram(
    population,
    x="2010 Population",
    title="Distribution of Community District populations, 2010",
    height=300,
    nbins=20,
)
fig.show()
Loading...

What is the nbins doing?

In-class exercise

How would we make a table of number of community districts per borough?

# code here

How would we calculate the average community district population by borough?

# code here

Data from where we left off last class

Derived dataset containing count of complaints and populations of each community district.

districts = pd.read_csv("https://storage.googleapis.com/python-public-policy2/data/community_district_311.csv.zip")
districts.head()
Loading...

Looking at raw volume is probably less useful than density.

Calculate 311 requests per capita

Divide request count by 2010 population to get requests per capita

districts["requests_per_capita"] = districts["num_311_requests"] / districts["2010 Population"]

districts.head()
Loading...

Let’s create a simplified new dataframe that only include the columns we care about and in a better order.

columns = [
    "boro_cd",
    "Borough",
    "CD Name",
    "2010 Population",
    "num_311_requests",
    "requests_per_capita",
]
cd_data = districts[columns]

cd_data
Loading...

Let’s check out which Community Districts have the highest complaints per capita

cd_data.sort_values("requests_per_capita", ascending=False).head(10)
Loading...

While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it’s low residential population.

# cd_data.to_csv("data/311_community_districts.csv", index=False)

How does the per-capita distribution compare to that of the raw counts?

fig = px.histogram(districts, x="requests_per_capita", height=200)
fig.show()
Loading...
fig = px.histogram(districts, x="num_311_requests", height=200)
fig.show()
Loading...
fig = px.histogram(
    districts,
    x="requests_per_capita",
    title="Volume of 311 requests, 2018-2019",
    labels={"requests_per_capita": "311 requests per capita"},
)

# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text="Number of community districts")
fig.show()
Loading...

Scatterplot

fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
)

fig.show()
Loading...
fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
    trendline="ols",
)

fig.show()
Loading...

Let’s take a look at the statistical summary, via the statsmodels package, following Plotly’s example:

trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()
Loading...

Lots of types of charts

Important things are knowing:

  • What common chart types are called, so you can search how to make them

  • Why you’d pick one versus another

Chart hygiene

  • Always include a title.

  • Make sure you label dependent and independent variables (X and Y axes).

  • Consider whether you’re working with continuous vs. discrete values.

  • Less is more.

    • If you’re trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying.