Lecture 15: Introduction to the Policy context#
Aidan Feldman
Computing in Context (SIPA)
Structure for today#
Intro
Going over course info like the syllabus, tools, etc.
Rewind: Programming languages, data, and Jupyter
About me#
Coding since 2005 🖥
Government since 2014 🦅
Teaching since 2011 🎓
Also a modern dancer 💃 and cyclist 🚲
Day jobs#
Currently freelancing with the Colorado Behavioral Health Administration. In the past, have worked for…
Government#
Non-profits#
Tech companies#
Intros#
Name
Pronouns
Why you’re taking this class / what you want to do with it
The more specific, the better.
Access the course site#
You can also get there through CourseWorks.
Class structure#
Class materials walkthrough#
New context-specific stuff:
Files
Disclaimers#
Me#
Here to teach you to:
Do a lot with just a little code
Troubleshoot
Google stuff
Not a statistician
You#
Are not going to understand everything the first time
Will want to throw your computer out a window at one or many points in the class
Celebrate the little victories
Will get out of it what you put into it
Politics/protests/war#
⏪ Restart#
Spreadsheets vs. programming languages#
What do you like about spreadsheets?
Why spreadsheets#
The easy stuff is easy
Lots of people know how to use them
Mostly just have to point, click, and scroll
Data and logic live together as one
Why programming languages#
Data and logic don’t live together
Why might this matter?
More powerful, flexible, and expressive than spreadsheet formulas; don’t have to cram into a single line
=SUM(INDEX(C3:E9,MATCH(B13,C3:C9,0),MATCH(B14,C3:E3,0)))
Better at working with large data
Google Sheets and Excel have hard limits at 1-5 million rows, but get slow long before that
Reusable code (packages)
Automation
Side-by-side1#
Task |
Spreadsheets |
Programming Languages |
---|---|---|
Loading data |
Easy |
Medium |
Viewing data |
Easy |
Medium |
Filtering data |
Easy |
Medium |
Manipulating data |
Medium |
Medium |
Joining data |
Hard |
Medium |
Complicated transforms |
Impossible2 |
Medium |
Automation |
Impossible2 |
Medium |
Making reusable |
Limited3 |
Medium |
Large datasets |
Impossible |
Hard |
These ratings are obviously subjective
Not including scripting, including Excel’s new Python+pandas support
Python vs. other languages#
Good for general-purpose and data stuff
Widely used in both industry and academia
Relatively easy to learn
Open source
Where to Python#
Pyton can be run in:
A text file, using the
python
commandAn integrated development environment (IDE) like Spyder or PyCharm
-
Various other tools are built around them
What we’ll be using for this class
Each can be on your computer (“local”), or in the cloud somewhere. All call python
under the hood, more or less.
Packages#
a.k.a. “libraries”
Developers have create them to make code/functionality reusable and easily sharable
Software plugins that you
import
Main packages we’ll use:
pandas
plotly
A module is a file containing Python definitions and statements.
https://docs.python.org/3/tutorial/modules.html
Your code, part of the standard library, or part of a package.
Pandas#
Review from Lab 7
A Python package (bundled up code that you can reuse)
Very common for data science in Python
-
Both organize around “data frames”
Jupyter#
Web based programming environment
Supports Python by default, and other languages with added kernels
Nicely displays output of your code so you can check and share the results
Avoids using the command line
We’ll be using JupyterLab through the Anaconda Distribution.
Command line vs. Jupyter#
Jupyter basics#
A “cell” can be either code or Markdown (text). Raw Markdown looks like this:
## A heading
Plain text
[A link](https://somewhere.com)
Running#
You “run” a cell by either:
Pressing the ▶️ button
Pressing
Control
+Enter
on your keyboard
Cells don’t run unless you tell them to, in the order you do so
Generally, you want to do so from the top every time you open a notebook
Output#
The last thing in a code cell is what gets displayed when it’s run
The output gets saved as part of the notebook
Just because there’s existing output from a cell, doesn’t mean that cell has been run during this session
Some pandas/Jupyter best practices#
Make variable names descriptive
Ignore that all examples use
requests
Only do one thing per line
Makes troubleshooting easier
Make notebooks idempotent
Makes your work reproducible
Use
Restart and run all
(⏩ button in toolbar)