import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
AE 12: Ultimate candy ranking
In this application exercise, we will:
Use backwards elimination to do model selection. Make sure to show each step of decision (though you don’t have to interpret the models at each stage).
Yes, this is tedious. And yes, there are ways of automating it. But for now, go through the process “manually”, to get a good sense of how the model selection algorithm works.
Provide interpretations for the slopes for the final model you arrive at and create at least one visualization that supports your narrative.
Examine the data
- We will use the
candy_rankings.csv
dataset for this analysis.
= pd.read_csv('data/candy_rankings.csv')
candy_rankings candy_rankings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 competitorname 85 non-null object
1 chocolate 85 non-null bool
2 fruity 85 non-null bool
3 caramel 85 non-null bool
4 peanutyalmondy 85 non-null bool
5 nougat 85 non-null bool
6 crispedricewafer 85 non-null bool
7 hard 85 non-null bool
8 bar 85 non-null bool
9 pluribus 85 non-null bool
10 sugarpercent 85 non-null float64
11 pricepercent 85 non-null float64
12 winpercent 85 non-null float64
dtypes: bool(9), float64(3), object(1)
memory usage: 3.5+ KB
Exercises
Use the variables:
chocolate
, fruity
, nougat
, pricepercent
, sugarpercent
, sugarpercent*chocolate
, pricepercent*fruity
Exercise 1
Create the full model and show the \(R^2_{adj}\):
# add code here
Is the model a good fit of the data?
Add response here.
Exercise 2
Produce all possible models removing 1 term at a time from the full model. Describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
Exercise 3
Compare all models using the framework (also use the same below):
= max(models, key=models.get)
best_model_step1 print(f'Best model in Exercise 2: {best_model_step1} with Adjusted R-squared: {models[best_model_step1]}')
- Which model is best:
Add response here.
Exercise 4
Create all possible models removing 1 term at a time from the model selected in the previous exercise. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
Exercise 5
Compare all models using the framework best_model_step2 = max(models, key=models.get)
:
# add code here
- Which model is best:
Add response here.
Exercise 6
Create all possible models removing 1 term at a time from the model selected in the previous step. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
- Add what is being removed here.
# add code here
Exercise 7
Compare all models using the framework best_model_step3 = max(models, key=models.get)
:
# add code here
- Which model is best:
Add response here
- Show the final model summary and coefficients:
# add code here