Lab 7 - Modeling II

Lab

Introduction

In this lab you’ll practice machine learning modeling and then venture on to model validation and quantifying uncertainty.

Packages

In this lab we will work with the pandas and sklearn modules.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

Guidelines

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Note

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to Run all, commit, and sync your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Additionally, if you’re using functions that are not introduced in the course materials, you must cite your sources.

Important

Failure to cite outside resources used, including Large Language Models like Chat GPT, is a violation of the University of Arizona Code of Academic Integrity and will be treated as such.

Part 1 - Building a spam filter

The data come from incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012. All personally identifiable information has been removed. The dataset is called email and it’s in your data folder.

The outcome variable is spam, which takes the value 1 if the email is spam, 0 otherwise.

Question 1

a. Fit a logistic regression model predicting spam from exclaim_mess (the number of exclamation points in the email message). Then, display the tidy output of the model.

b. Using this model and the Python function .predict_proba(), predict the probability the email is spam if it contains 10 exclamation points.

Question 2

a. Fit another logistic regression model predicting spam from exclaim_mess, winner (indicating whether “winner” appeared in the email), and urgent_subj (whether the word “urgent” is in the subject of the email). Then, display the output of the model intercept and coefficients.

b. Using this model, predict spam / not spam for all emails in the email dataset.

c. Using your data frame from the previous part, determine, in a single pipeline, the numbers of emails:

that are labelled as spam that are actually spam
that are not labelled as spam that are actually spam
that are labelled as spam that are actually not spam
that are not labelled as spam that are actually not spam

d. In a single pipeline, calculate the false positive and false negative rates.

In addition to these numbers showing in your Python output, you must write a sentence that explicitly states and identifies the two rates.

Question 3

a. Fit another logistic regression model predicting spam from exclaim_mess and another variable you think would be a good predictor. Provide a 1-sentence justification for why you chose this variable. Display the tidy output of the model.

b. Using this model, predict spam / not spam for all emails in the email dataset.

c. Using your data frame from the previous part, determine, in a single pipeline, the numbers of emails:

that are labelled as spam that are actually spam
that are not labelled as spam that are actually spam
that are labelled as spam that are actually not spam
that are not labelled as spam that are actually not spam

d. In a single pipeline, calculate the false positive and false negative rates.

In addition to these numbers showing in your Python output, you must write a sentence that explicitly states and identifies the two rates.

e. Based on the false positive and false negatives rates of this model, comment, in 1-2 sentences, on which model is preferable and why.

Part 2 - Model validation

In this section, we will explore how to evaluate the performance of your models using k-fold cross validation. This technique helps in assessing how the results of a statistical analysis will generalize to an independent dataset.

Question 4

Explain in your own words what k-fold cross validation is and why it is useful.
Implement 5-fold cross validation on the logistic regression model from Question 1a predicting spam from exclaim_mess. Use the cross_val_score from sklearn to assist with this.
Summarize the results of your cross validation. What is the average accuracy and standard deviation across the folds?

Question 5

Implement 10-fold cross validation on the logistic regression model from Question 2a predicting spam from exclaim_mess, winner, and urgent_subj.
Summarize the results of your cross validation. Compare the average accuracy and standard deviation across the folds to the results from Question 4. What do you observe?
Based on the cross-validation results, which model do you think performs better in terms of generalizability? Justify your answer.

Question 6

Plot the accuracy scores of each fold for both the 5-fold and 10-fold cross-validation.
Describe any patterns you observe in the plot.

Part 3 - Hotel cancellations

For this exercise, we will work with hotel cancellations. The data describe the demand of two different types of hotels. Each observation represents a hotel booking between July 1, 2015 and August 31, 2017. Some bookings were cancelled (is_canceled = 1) and others were kept, i.e., the guests checked into the hotel (is_canceled = 0). You can view the code book for all variables here.

The data can be found in the data folder: hotels.csv.

Question 7

Read the data and then explore attributes of bookings and summarize your findings in 5 bullet points. You must provide a visualization or summary supporting each finding.

Note

This is not meant to be an exhaustive exploration. We anticipate a wide variety of answers to this question.

Question 8

Using these data, we will try to answer the following question:

Do we expect reservations earlier in the month or later in the month to be cancelled?

Exploration: In a single pipeline, calculate the mean arrival date (arrival_date_day_of_month) for both booking that were cancelled and that were not cancelled.
Justification: In your own words, explain why we can not fit a linear model to model the relationship between if a hotel reservation was cancelled and the day of month for the booking.
Model fitting and interpretation:
- Fit the appropriate model and display a tidy summary of the model output.
- Interpret the slope coefficient in context of the data and the research question.
Predicted: Calculate the probability that the hotel reservation is cancelled if it the arrival date date is on the 26th of the month. Based on this probability, would you predict this booking would be cancelled or not cancelled. Explain your reasoning for your classification.

Part 4 - Get creative

Question 9

Your task is to make the following plot as ugly and as ineffective as possible. Change colors, axes, fonts, theme, or anything else you can think of in the code chunk below. You can also search online for other themes, fonts, etc. that you want to tweak. Try to make it as ugly as possible, the sky is the limit!

In 2-3 sentences, explain why the plot you created is ugly (to you, at least) and ineffective.

penguins = pd.read_csv("data/penguins.csv")

sns.set(style="darkgrid")
sns.scatterplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species")
plt.show()

Part 5 - Harassment at work

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, inter-group tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

The data for this part comes from the gss16 data frame (containing data from the 2016 GSS) found in your data folder.

Question 10

In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.

Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?

Answers to this question are stored in the harass5 variable in our data set.

Create a subset of the data that only contains Yes and No answers for the harassment question. How many respondents chose each of these answers?
Describe how bootstrapping can be used to estimate the proportion of all Americans who have been harassed by their superiors or co-workers at their job.
Calculate a 95% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Use 1000 iterations when creating your bootstrap distribution. Interpret this interval in context of the data.

Wrap-up

Submission

Warning

Before you wrap up the assignment, make sure all of your documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in the .ipynb file by the submission deadline to be considered “on time”.

Checklist

Make sure you have:

attempted all questions
run all code in your Jupyter notebook
committed and pushed everything to your GitHub repository such that the Git pane in VS Code is empty

Grading

The lab is graded out of a total of 50 points.

On Questions 1 through 10, you can earn up to 5 points on each question:

5: Response shows excellent understanding and addresses all or almost all of the rubric items.
4: Response shows good understanding and addresses most of the rubric items.
3: Response shows understanding and addresses a majority of the rubric items.
2: Response shows effort and misses many of the rubric items.
1: Response does not show sufficient effort or understanding and/or is largely incomplete.
0: No attempt.