Site analytics#
Load data#
Data downloaded from Read the Docs, where this site is hosted. Documentation.
import pandas as pd
analytics = pd.read_csv("assets/data/readthedocs_traffic_analytics_computing-in-context_2024-09-15_2024-12-14.csv")
analytics
Date | Version | Path | Views | |
---|---|---|---|---|
0 | 2024-12-14 00:00:00 | latest | /lecture_15.html | 1 |
1 | 2024-12-14 00:00:00 | latest | /index.html | 3 |
2 | 2024-12-14 00:00:00 | latest | /lab_12.html | 1 |
3 | 2024-12-13 00:00:00 | latest | /notebooks.html | 1 |
4 | 2024-12-13 00:00:00 | latest | /lecture_20.html | 1 |
... | ... | ... | ... | ... |
1112 | 2024-10-21 00:00:00 | latest | /index.html | 28 |
1113 | 2024-10-20 00:00:00 | latest | /project_1.html | 1 |
1114 | 2024-10-20 00:00:00 | latest | /index.html | 9 |
1115 | 2024-10-20 00:00:00 | latest | /lecture_15.html | 1 |
1116 | 2024-10-20 00:00:00 | latest | /lecture_16.html | 1 |
1117 rows × 4 columns
Cleaning#
analytics.dtypes
Date object
Version object
Path object
Views int64
dtype: object
analytics["Date"] = pd.to_datetime(analytics["Date"])
analytics
Date | Version | Path | Views | |
---|---|---|---|---|
0 | 2024-12-14 | latest | /lecture_15.html | 1 |
1 | 2024-12-14 | latest | /index.html | 3 |
2 | 2024-12-14 | latest | /lab_12.html | 1 |
3 | 2024-12-13 | latest | /notebooks.html | 1 |
4 | 2024-12-13 | latest | /lecture_20.html | 1 |
... | ... | ... | ... | ... |
1112 | 2024-10-21 | latest | /index.html | 28 |
1113 | 2024-10-20 | latest | /project_1.html | 1 |
1114 | 2024-10-20 | latest | /index.html | 9 |
1115 | 2024-10-20 | latest | /lecture_15.html | 1 |
1116 | 2024-10-20 | latest | /lecture_16.html | 1 |
1117 rows × 4 columns
Ensure legend and x axis are in order.
analytics = analytics.sort_values(["Path", "Date"])
Only include real pages.
html_only = analytics["Path"].str.endswith(".html")
no_redirects = ~analytics["Path"].str.startswith("/redirects/")
analytics = analytics[html_only & no_redirects]
Pageviews per day by page#
def add_important_dates(fig):
# https://github.com/plotly/plotly.py/issues/3065#issuecomment-778652215
# https://bulletin.columbia.edu/sipa/registration/
fig.add_vline(
x=datetime(2024, 11, 18).timestamp() * 1000,
line_dash="dash",
annotation_text="Registration start",
)
fig.add_vline(
x=datetime(2024, 12, 3).timestamp() * 1000,
line_dash="dash",
annotation_text="Test",
)
import plotly.express as px
from datetime import datetime
fig = px.line(
analytics,
x="Date",
y="Views",
color="Path",
title="Pageviews per day",
)
add_important_dates(fig)
fig.show()
Site-wide pageviews per day#
def aggregate_by(df, offset):
aggregated = df.resample(offset, on="Date").sum()
# only keep relevant column
aggregated = aggregated[["Views"]]
# don't include the last period of the dataset, as it's incomplete and thus misleadingly low
aggregated = aggregated.iloc[:-1]
return aggregated
views_by_day = aggregate_by(analytics, "D")
views_by_day
Views | |
---|---|
Date | |
2024-10-20 | 12 |
2024-10-21 | 39 |
2024-10-22 | 445 |
2024-10-23 | 59 |
2024-10-24 | 350 |
2024-10-25 | 484 |
2024-10-26 | 21 |
2024-10-27 | 42 |
2024-10-28 | 80 |
2024-10-29 | 513 |
2024-10-30 | 231 |
2024-10-31 | 420 |
2024-11-01 | 545 |
2024-11-02 | 124 |
2024-11-03 | 132 |
2024-11-04 | 157 |
2024-11-05 | 113 |
2024-11-06 | 244 |
2024-11-07 | 529 |
2024-11-08 | 727 |
2024-11-09 | 148 |
2024-11-10 | 79 |
2024-11-11 | 300 |
2024-11-12 | 317 |
2024-11-13 | 128 |
2024-11-14 | 411 |
2024-11-15 | 473 |
2024-11-16 | 103 |
2024-11-17 | 115 |
2024-11-18 | 99 |
2024-11-19 | 418 |
2024-11-20 | 247 |
2024-11-21 | 326 |
2024-11-22 | 615 |
2024-11-23 | 289 |
2024-11-24 | 80 |
2024-11-25 | 177 |
2024-11-26 | 353 |
2024-11-27 | 194 |
2024-11-28 | 221 |
2024-11-29 | 240 |
2024-11-30 | 185 |
2024-12-01 | 371 |
2024-12-02 | 841 |
2024-12-03 | 1248 |
2024-12-04 | 257 |
2024-12-05 | 472 |
2024-12-06 | 701 |
2024-12-07 | 373 |
2024-12-08 | 160 |
2024-12-09 | 167 |
2024-12-10 | 143 |
2024-12-11 | 22 |
2024-12-12 | 9 |
2024-12-13 | 12 |
fig = px.line(
views_by_day,
y="Views",
title="Pageviews per day",
)
add_important_dates(fig)
fig.show()
Site-wide pageviews per week (total)#
views_by_week = aggregate_by(analytics, "W")
# https://stackoverflow.com/a/19851521/358804
views_by_week.index.names = ["Week starting"]
views_by_week
Views | |
---|---|
Week starting | |
2024-10-20 | 12 |
2024-10-27 | 1440 |
2024-11-03 | 2045 |
2024-11-10 | 1997 |
2024-11-17 | 1847 |
2024-11-24 | 2074 |
2024-12-01 | 1741 |
2024-12-08 | 4052 |
px.line(
views_by_week,
y="Views",
title="Pageviews per week",
)