Project 1#

In this project, you will:

Steps#

  1. Read the general Project information.

  2. Find a dataset.

    • It must have:

      • At least one numeric column

      • Between one thousand and one million rows

        • If it’s larger than that, you can filter it down.

    • Don’t spend too long on this step.

  3. If there’s more than one numeric column, pick one.

  4. Create a new notebook.

  5. Using pandas:

    1. Read in the data.

    2. Compute:

      • The mean

      • The median

      • The mode

  6. Repeat the previous step using only the Python standard library, a.k.a. the hard way.

  7. Create a data visualization, following the instructions below.

  8. Submit.

The hard way#

  • You may not use pandas, the statistics module, a spreadsheet program, etc.

    • You should be using the same dataset from the first step, but not accessing the DataFrame/Series.

      • In other words, if put the code for this step in a totally separate notebook, it should still work.

  • You should be calculating the mean, median, and mode yourself, not using functions with those names (or equivalent).

    • Hint: Use a dictionary to keep track of value counts.

Data visualization#

Requirements:

  • The data/calculations can come through pandas, but the drawing code should only use the Python standard library.

    • In other words, don’t use plot(), plotly, or any other external packages.

  • The visualization should be visual, using shape, size, symbols, etc. to represent the values. — Printing the numbers (as is) isn’t sufficient.

  • General requirements

We’ll talk about data visualization in more detail in week 10, but none of that knowledge is expected to complete this.

Example#

Data that looks like this:

Rat sightings

Year

Count

2014

3,162

2015

4,985

2016

4,091

could be turned into a sparkline that looks like this:

Rat sightings, in thousands

2014: ***
2015: *****
2016: ****

Please don’t print 3,162 asterisks (*) 😉

Tips#

  • Start simple.

    • Start with the example above, get that working, then go from there.

    • Use only one or two columns of your dataset.

    • print()ing strings will probably be easiest, but you can get fancy and generate HTML if you want.

  • Making your chart vertical (one data point per line) will probably be easier than doing something horizontal.

  • Techniques that may be helpful:

  • Python strings can contain Unicode, including emoji 📈✨

Rubric#