Lab 5 - Probability and Simpson’s Paradox

Lab
Important

This is a Team Lab, meaning you can and should work with your project team on it.

Introduction

In this lab you’ll review and get practice with a variety of concepts, methods, and tools you’ve encountered thus far, with a focus on misrepresentation and ethics.

Learning objectives

By the end of the lab, you will…

  • Calculate single event, conditional, and “and” probabilities.
  • Interpret probabilities in the context of the problem.
  • Display a fundamental understanding of Simpson’s Paradox.
  • Practice teamwork and collaboration on GitHub.

Guidelines

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Note

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to Run all, commit, and sync your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Getting started

All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document Markdown cell by adding the team name to the Team name field and adding the names of the team members contributing to lab to the Authors field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.

# Lab 5 - Probability and Simpson’s Paradox
### Team name
### Authors: 
  - "Team member 1"
  - "Team member 2"
  - "Team member 3"
  - "Team member 4"
  - "Team member 5"
### Date: ADD DATE HERE

Questions

Important

Pick another member of the team write the answer to Part 1. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Part 1 - Probability and you

Question 1

We use probabilities all the time when making decisions. As a group, provide two real world examples of when you’ve used probability to make decisions in your every day life. Think critically. Be creative.

Important

After the team member working on Part 1 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Part 2. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Part 2 - Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

Question 2

Load in the data set called education-disease and answer the following questions below.

Question 3

How many levels of education are there in these data? How many levels of disease are there? Hint: The unique() function might be helpful.

Question 4

Convert the data to a two-way table where each cell is the number of people falling into each combination of Disease and Education. Hint: Use groupby and pivot_table. Your answer should be a 4x3 data frame with counts in each cell.

# add code here.

Using the summary table you created above, answer the remaining questions. You do not have to use Python functions for your calculations, you can use Python as a calculator using the values from the summary table. Make sure to show your work, i.e., instead of reporting just the final answer, use Python to calculate that in a way we can see the counts you’ve used along the way.

Question 5

What is the probability of a random individual having high school or GED education and not being high risk for cardiovascular disease?

Question 6

What is the probability that a random individual who is already high risk for cardiovascular disease has a college education?

Important

After the team member working on Part 2 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Part 3. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

Part 3 - Bike rentals

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. You are tasked to investigate the relationship between the temperature outside and the number of bikes rented in the Washington DC area between the years 2011 and 2022. You will be investigating data for the months June, July, September, and November.

Below is a list of variables and their definitions:

Variable Definition
season Numerical representation of Spring (2), Summer (3), and Fall (4)
year Numerical representation of 2011 (0) or 2012 (1)
month Month in which data were collected
holiday Indicator variable for whether data were collected on a holiday (1) or not (0)
weekday Numerical representation of day of week
temp Temperature in Celsius
count Number of bike rentals for that day

Question 7

  1. Read in the bike data. Then, create a scatter plot that investigates the relationship between the number of bikes rented and the temperature outside. Include a straight line of best fit to help discuss the discovered relationship. Summarize your findings in 2-3 sentences.

  2. Another researcher suggests to look at the relationship between bikes rented and temperature by each of the four months of interest. Recreate your plot in part a, and color the points by month. Include a straight line for each of the four months to help discuss each month’s relationship between bikes rented and temperature. In 3-4 sentences, summarize your findings.

Please watch the following video on Simpson’s Paradox here. After you do, please answer the following questions.

Question 8

In your own words, summarize Simpson’s Paradox in 2-3 sentences.

Question 9

Compare and contrast your findings from part (a) and part (b). What’s different?

Question 10

Think critically about your answer to part d. What other context from this study could be creating this paradox? That is, identify a potential confounding

Closing an issue with a commit

Go to your GitHub repository, you will see an issue with the title “Learn to close an issue with a commit”.

Your goal is to close this issue with a commit to practice this workflow, which is the workflow you will use when you are addressing feedback on your projects.

  • Go to the relevant section in your lab .ipynb file.
  • Delete the sentence that says “Delete me”.
  • Render the document.
  • Commit your changes from the git tab with the commit message “Delete sentence, closes #1.”
  • Push your changes to your repo and observe that the issue is now closed and the commit associated with this move is linked from the issue.

GitHub allows you to close an issue directly with commits if the commit uses one of the following keywords followed bu the issue number (which you can find next to the issue title): close, closes, closed, fix, fixes, fixed, resolve, resolves, and resolved.

Wrap-up

Submission

Warning

Before you wrap up the assignment, make sure all of your documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in the .ipynb file by the submission deadline to be considered “on time”.

Checklist

Make sure you have:

  • attempted all questions
  • run all code in your Jupyter notebook
  • committed and pushed everything to your GitHub repository such that the Git pane in VS Code is empty

Grading

The lab is graded out of a total of 50 points.

On Questions 1 through 10, you can earn up to 5 points on each question:

  • 5: Response shows excellent understanding and addresses all or almost all of the rubric items.

  • 4: Response shows good understanding and addresses most of the rubric items.

  • 3: Response shows understanding and addresses a majority of the rubric items.

  • 2: Response shows effort and misses many of the rubric items.

  • 1: Response does not show sufficient effort or understanding and/or is largely incomplete.

  • 0: No attempt.