Meet the toolkit

Lecture 1

Dr. Greg Chism

University of Arizona
INFO 511 - Fall 2024

Announcements

  • If you have not yet completed the Getting to know you survey, please do so asap!

  • If you have not yet accepted the invite to join the course GitHub Organization, please do so asap!

  • Office hours linked at https://datasciaz.netlify.app/course-team.html

From last time…

Course homepage

Let’s take a tour!

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Homeworks must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

  • Exams must be completed individually. You may not discuss any aspect of the exam with peers. If you have questions, post as private questions on the course forum, only the teaching team will see and answer.

Sharing / reusing code policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Unless explicitly stated otherwise, this course’s policy is that you may make use of any online resources (e.g. RStudio Community, StackOverflow, etc.) but you must explicitly cite where you obtained any code you directly use or use as inspiration in your solution(s).

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source

Use of generative AI

  • Treat generative AI, such as ChatGPT, as an online resource.

  • Guiding principles:

    • (1) Cognitive dimension: Working with AI should not reduce your ability to think clearly. We will practice using AI to facilitate—rather than hinder—learning.

    • (2) Ethical dimension: Students using AI should be transparent about their use and make sure it aligns with academic integrity.

  • ✅ AI tools for code: You may make use of the technology for coding examples on assignments; if you do so, you must explicitly cite where you obtained the code.

  • ❌ AI tools for narrative: Unless instructed, you may not use generative AI to write narrative on assignments. You may use generative AI as a resource as you complete assignments but not for answers.

Academic integrity

To uphold the UArizona iSchool Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;
  • I will conduct myself honorably in all my endeavors; and
  • I will act if the Standard is compromised.

Most importantly!

Ask if you’re not sure if something violates a policy!

Five tips for success

  1. Complete all the preparation work before class.

  2. Ask questions.

  3. Do the readings.

  4. Do the lab.

  5. Don’t procrastinate – at least on a weekly basis!

Course toolkit

Course toolkit

Course operation

Doing data science

  • Computing:
    • Python
    • VS Code
    • Quarto/Jupyter
  • Version control and collaboration:
    • Git
    • GitHub

Toolkit: Computing

Learning goals

By the end of the course, you will be able to…

  • gain insight from data
  • gain insight from data, reproducibly
  • gain insight from data, reproducibly, using modern programming tools and techniques
  • gain insight from data, reproducibly and collaboratively, using modern programming tools and techniques
  • gain insight from data, reproducibly (with literate programming and version control) and collaboratively, using modern programming tools and techniques

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) Python
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto/Jupyter
  • Version control \(\rightarrow\) Git / GitHub

Python and Jupyter

Python and Jupyter

R logo

  • Python is an open-source general purpose programming language
  • Python is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • Jupyter is a convenient interface for Python called an IDE (integrated development environment), e.g. “I write Python code in the Jupyter IDE”
  • Jupyter is not a requirement for programming with Python, but it’s very commonly used by Python programmers and data scientists

Python vs. Jupyter

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

Python packages

  • Packages: Fundamental units of reproducible Python code, including reusable Python modules/functions, the documentation that describes how to use them, and sample data1

  • As of 23 July 2024, there are 557,005 Python packages (projects) available on PyPI (the Python Package Index)2

  • We’re going to work with a small (but important) subset of these!

Tour: Python + Jupyter (via VS Code)

Option 1:

Sit back and enjoy the show!

Option 2:

Clone the corresponding application exercise repo and follow along.

ae-01-meet-the-penguins

Go to the course GitHub organization and clone ae-01-meet-the-penguins to your environment.

Tour recap: Python + Jupyter (via VS Code)

A short list (for now) of Python essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
to_this.do_this()
to_that.do_that(to_this, with_those)
  • Packages are installed with the pip install function (via the terminal)…
pip install package_name
  • … and loaded with the import function, once per session (usually with a shorthand “nickname”):
import package_name as pkg

Python essentials (continued)

  • Columns (variables) in data frames are accessed with ['']:
dataframe['var_name']
  • Object documentation can be accessed with help()
help(pd.Series.mean)

pandas

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

pandas.pydata.org

  • Pandas is a quintessential package designed for data analysis

Jupyter Notebooks

Jupyter Notebooks

  • Fully reproducible reports – each time you run the analysis is ran from the beginning
  • Code goes in code chunks narrative goes in markdown chunks
  • A visual editor for a familiar / Google docs-like editing experience

Tour: Jupyter Notebooks

Option 1:

Sit back and enjoy the show!

Option 2:

Clone the corresponding application exercise repo and follow along.

ae-01-meet-the-penguins

Go to the course GitHub organization and clone ae-01-meet-the-penguins to your environment.

Tour recap: Jupyter Notebooks

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

How will we use Jupyter Notebooks?

  • Every application exercise, lab, project, etc. is a Jupyter notebook
    • However, projects will be built with Quarto Websites (more later)
  • You’ll always have a template Jupyter notebook to start with
  • The amount of scaffolding in the template will decrease over the semester