Site analytics#

Load data#

Data downloaded from Read the Docs, where this site is hosted. Documentation.

import pandas as pd

analytics = pd.read_csv("assets/data/readthedocs_traffic_analytics_computing-in-context_2024-09-15_2024-12-14.csv")
analytics
Date Version Path Views
0 2024-12-14 00:00:00 latest /lecture_15.html 1
1 2024-12-14 00:00:00 latest /index.html 3
2 2024-12-14 00:00:00 latest /lab_12.html 1
3 2024-12-13 00:00:00 latest /notebooks.html 1
4 2024-12-13 00:00:00 latest /lecture_20.html 1
... ... ... ... ...
1112 2024-10-21 00:00:00 latest /index.html 28
1113 2024-10-20 00:00:00 latest /project_1.html 1
1114 2024-10-20 00:00:00 latest /index.html 9
1115 2024-10-20 00:00:00 latest /lecture_15.html 1
1116 2024-10-20 00:00:00 latest /lecture_16.html 1

1117 rows × 4 columns

Cleaning#

analytics.dtypes
Date       object
Version    object
Path       object
Views       int64
dtype: object
analytics["Date"] = pd.to_datetime(analytics["Date"])
analytics
Date Version Path Views
0 2024-12-14 latest /lecture_15.html 1
1 2024-12-14 latest /index.html 3
2 2024-12-14 latest /lab_12.html 1
3 2024-12-13 latest /notebooks.html 1
4 2024-12-13 latest /lecture_20.html 1
... ... ... ... ...
1112 2024-10-21 latest /index.html 28
1113 2024-10-20 latest /project_1.html 1
1114 2024-10-20 latest /index.html 9
1115 2024-10-20 latest /lecture_15.html 1
1116 2024-10-20 latest /lecture_16.html 1

1117 rows × 4 columns

Ensure legend and x axis are in order.

analytics = analytics.sort_values(["Path", "Date"])

Only include real pages.

html_only = analytics["Path"].str.endswith(".html")
no_redirects = ~analytics["Path"].str.startswith("/redirects/")
analytics = analytics[html_only & no_redirects]

Pageviews per day by page#

def add_important_dates(fig):
    # https://github.com/plotly/plotly.py/issues/3065#issuecomment-778652215
    
    # https://bulletin.columbia.edu/sipa/registration/
    fig.add_vline(
        x=datetime(2024, 11, 18).timestamp() * 1000,
        line_dash="dash",
        annotation_text="Registration start",
    )
    fig.add_vline(
        x=datetime(2024, 12, 3).timestamp() * 1000,
        line_dash="dash",
        annotation_text="Test",
    )
import plotly.express as px
from datetime import datetime

fig = px.line(
    analytics,
    x="Date",
    y="Views",
    color="Path",
    title="Pageviews per day",
)
add_important_dates(fig)
fig.show()

Site-wide pageviews per day#

def aggregate_by(df, offset):
    aggregated = df.resample(offset, on="Date").sum()

    # only keep relevant column
    aggregated = aggregated[["Views"]]
    # don't include the last period of the dataset, as it's incomplete and thus misleadingly low
    aggregated = aggregated.iloc[:-1]

    return aggregated
views_by_day = aggregate_by(analytics, "D")
views_by_day
Views
Date
2024-10-20 12
2024-10-21 39
2024-10-22 445
2024-10-23 59
2024-10-24 350
2024-10-25 484
2024-10-26 21
2024-10-27 42
2024-10-28 80
2024-10-29 513
2024-10-30 231
2024-10-31 420
2024-11-01 545
2024-11-02 124
2024-11-03 132
2024-11-04 157
2024-11-05 113
2024-11-06 244
2024-11-07 529
2024-11-08 727
2024-11-09 148
2024-11-10 79
2024-11-11 300
2024-11-12 317
2024-11-13 128
2024-11-14 411
2024-11-15 473
2024-11-16 103
2024-11-17 115
2024-11-18 99
2024-11-19 418
2024-11-20 247
2024-11-21 326
2024-11-22 615
2024-11-23 289
2024-11-24 80
2024-11-25 177
2024-11-26 353
2024-11-27 194
2024-11-28 221
2024-11-29 240
2024-11-30 185
2024-12-01 371
2024-12-02 841
2024-12-03 1248
2024-12-04 257
2024-12-05 472
2024-12-06 701
2024-12-07 373
2024-12-08 160
2024-12-09 167
2024-12-10 143
2024-12-11 22
2024-12-12 9
2024-12-13 12
fig = px.line(
    views_by_day,
    y="Views",
    title="Pageviews per day",
)
add_important_dates(fig)
fig.show()

Site-wide pageviews per week (total)#

views_by_week = aggregate_by(analytics, "W")

# https://stackoverflow.com/a/19851521/358804
views_by_week.index.names = ["Week starting"]

views_by_week
Views
Week starting
2024-10-20 12
2024-10-27 1440
2024-11-03 2045
2024-11-10 1997
2024-11-17 1847
2024-11-24 2074
2024-12-01 1741
2024-12-08 4052
px.line(
    views_by_week,
    y="Views",
    title="Pageviews per week",
)