Project 3#
Complete Lab 12 before proceeding.
Objectives#
This project has two parts:
They can be done in either order, or in parallel.
Project site#
You don’t know how to do this yet. You will need to read the documentation to figure it out.
Make a homepage in your JupyterBook.
Suggest making it a Markdown file,
index.md
project_3
will no longer be theroot
Relevant documentation:
Add your Project 1 and 2 as pages (
chapters
) of your site.Add the following near the top of each Project notebook (before any Plotly code), and
Run All
. More information.import plotly.io as pio pio.renderers.default = "vscode+jupyterlab+notebook_connected"
Build and preview as many times as you need to confirm things show up as expected.
Tip
You can press the up arrow on your keyboard to get to previously-run commands, rather than having to re-type it each time.
The built site should look something like this:
You are more than welcome to customize the site as much as you like, but it’s recommended that you complete the Project first.
Caution
Please do not include your Lab notebooks in your site, per the Academic Integrity Policy.
Publish#
You will be deploying the site to Read the Docs via GitHub.
Add an
environment.yml
file to specify the conda package dependencies for building the site.name: computing-in-context channels: - default - conda-forge dependencies: - jupyter-book=1.* # https://github.com/sphinx-doc/sphinx/issues/10440#issuecomment-1556180835 - sphinx>=6.2.0
Add a
.readthedocs.yml
file, matching the one from this site.Commit and push the changes.
Go through the Read the Docs tutorial.
Skip “Preparing your repository on GitHub” - you’ve already done that.
Stop after “Checking the first build”.
Data analysis#
You can think of this as similar to the Project 2 requirements, but expanded. Examples of Final Projects for Python for Public Policy - the result of this Project will be similar.
Process#
You’ll be working in the notebook created in Lecture 23.
Find a dataset that seems interesting.
To meet the requirement that your project “not be trivial,” you probably want a dataset that’s large enough that you can’t understand it at a glance.
If you’re only using one dataset, you probably want it to have 500+ rows.
Load the data into a DataFrame.
Inspect the data a bit.
Fill out the prompt (below).
Work backwards: On a piece of paper / whiteboard, draw the visualization you imagine producing.
Use the data to answer the question.
If you end up answering your initial research question easily (haven’t met the requirements), that’s fine. Ask and answer follow-up question(s).
Go deep, not broad.
Prompt#
Put the following in a Markdown cell in your notebook and fill it out:
Dataset(s) to be used: [link]
Analysis question: [question]
Columns that will (likely) be used:
[Column 1]
[Column 2]
[etc]
(If you’re using multiple datasets) Columns to be used to merge/join them:
[Dataset 1] [column]
[Dataset 2] [column]
Hypothesis: [hypothesis]
Site URL: [the
*.readthedocs.io
URL of your live site, from the Publish section]
Raw Markdown
- **Dataset(s) to be used:** [link]
- **Analysis question:** [question]
- **Columns that will (likely) be used:**
- [Column 1]
- [Column 2]
- [etc]
- (If you're using multiple datasets) **Columns to be used to merge/join them:**
- [Dataset 1] [column]
- [Dataset 2] [column]
- **Hypothesis**: [hypothesis]
- **Site URL:** [the `*.readthedocs.io` URL of your live site, from the Publish section]
The question should be:
Specific
Objectively answerable through the data available
Just the right amount of ambitious (non-trivial)
If you want help/feedback, don’t hesitate to ask on Ed or come to office hours.
Tips#
Don’t overthink it; getting up through filling out the prompt shouldn’t take more than a few hours.
Your question/hypothesis doesn’t need to be something novel; confirming something you read / heard about is fine.
The point of the prompt is to ensure you’ve dug into the data and that your project scope is reasonable. Think of it as a guide rather than something you’re locked into.
Even the question can bake in assumptions.
Example: “What ZIP code has the highest number of food poisoning cases?” assumes a relationship between food-borne illness and geography.
What assumptions does your question make?
Analysis requirements#
Your submission should:
Meet the general Project information
Not be trivial - requiring:
At least 40 lines of code to come to a conclusion
That code should be relevant to answering your question. In other words, having 40 lines of
print("hello world")
wouldn’t count.If you meet all the other requirements, you will likely be well over this number.
You can count them automatically using a tool like tokei.
Transforming data through grouping, merging, and/or reshaping of DataFrames
Operations that aren’t easily done in a spreadsheet.
Have a visualization (chart or map) of some kind
Submission#
Make sure you have the prompt filled out, including your site URL. Then, submit.
If for some reason you’re unable to submit the full Project 3 notebook, you can submit a notebook / text file / etc. that just contains a link to your site.