AE 12: Ultimate candy ranking

Application exercise
Answers

In this application exercise, we will:

  1. Load the Penguins Dataset: Import and explore the dataset to understand its structure and the features available for analysis.

  2. Preprocess the Data: Clean the data by handling missing values and standardize the numerical features for PCA.

  3. Perform PCA: Apply Principal Component Analysis to reduce the dimensionality of the data and extract the principal components.

  4. Visualize the PCA Result: Create a scatter plot of the principal components to visualize the clustering of different penguin species.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

Examine the data

  • We will use the candy_rankings.csv dataset for this analysis.
candy_rankings = pd.read_csv('data/candy_rankings.csv')
candy_rankings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    85 non-null     object 
 1   chocolate         85 non-null     bool   
 2   fruity            85 non-null     bool   
 3   caramel           85 non-null     bool   
 4   peanutyalmondy    85 non-null     bool   
 5   nougat            85 non-null     bool   
 6   crispedricewafer  85 non-null     bool   
 7   hard              85 non-null     bool   
 8   bar               85 non-null     bool   
 9   pluribus          85 non-null     bool   
 10  sugarpercent      85 non-null     float64
 11  pricepercent      85 non-null     float64
 12  winpercent        85 non-null     float64
dtypes: bool(9), float64(3), object(1)
memory usage: 3.5+ KB

Exercises

Use the variables:

chocolate, fruity, nougat, pricepercent, sugarpercent, sugarpercent*chocolate, pricepercent*fruity

Exercise 1

Create the full model and show the \(R^2_{adj}\):

full_model = smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
print(f'Adjusted R-squared: {full_model.rsquared_adj}')
Adjusted R-squared: 0.42530696026839643

Is the model a good fit of the data?

The model moderately fits the data (42.5% variation explained).

Exercise 2

Produce all possible models removing 1 term at a time from the full model. Describe what is being removed above each code cell.

# Blank dictionary to store new models
models = {}
  • Remove chocolate and it’s associated interaction
model1 = smf.ols('winpercent ~ fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model1'] = model1.rsquared_adj
  • Remove fruity and its associated interaction
model2 = smf.ols('winpercent ~ chocolate * sugarpercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model2'] = model2.rsquared_adj
  • Remove nougat
model3 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model3'] = model3.rsquared_adj
  • Remove pricepercent and its associated interactions
model4 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + sugarpercent', data=candy_rankings).fit()
models['model4'] = model4.rsquared_adj
  • Remove sugarpercent*chocolate
model5 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model5'] = model5.rsquared_adj
  • Remove pricepercent*fruity
model6 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model6'] = model6.rsquared_adj

Exercise 3

Compare all models using the framework best_model_step1 = max(models, key=models.get):

best_model_step1 = max(models, key=models.get)
print(f'Best model in Exercise 2: {best_model_step1} with Adjusted R-squared: {models[best_model_step1]}')
Best model in Exercise 2: model3 with Adjusted R-squared: 0.4318254523153029
  • Which model is best:

Of the models from Exercises 1 and 2, the model with the highest adjusted \(R^2\) is the one with nougat removed. Therefore, we will go to Exercise 4 eliminating one variable at a time from that model.

Exercise 4

Create all possible models removing 1 term at a time from the model selected in the previous exercise. Again, describe what is being removed above each code cell.

# Blank dictionary to store new models
models = {}
  • Remove chocolate and its associated interactions
model7 = smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model7'] = model7.rsquared_adj
  • Remove fruity and its associated interactions
model8 = smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model8'] = model8.rsquared_adj
  • Remove pricepercent and its associated interactions
model9 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + sugarpercent', data=candy_rankings).fit()
models['model9'] = model9.rsquared_adj
  • Remove sugarpercent*chocolate
model10 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model10'] = model10.rsquared_adj
  • Remove pricepercent*fruity
model11 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model11'] = model11.rsquared_adj

Exercise 5

Compare all models using the framework best_model_step2 = max(models, key=models.get):

best_model_step2 = max(models, key=models.get)
print(f'Best model in Exercise 4: {best_model_step2} with Adjusted R-squared: {models[best_model_step2]}')
Best model in Exercise 4: model10 with Adjusted R-squared: 0.43488989967424574
  • Which model is best:

The model with sugarpercent*chocolate has the highest \(R^2\) of all the models we’ve tested so far, so we will now go to Exercise 6 eliminating one variable at a time from this model.

Exercise 6

Create all possible models removing 1 term at a time from the model selected in the previous step. Again, describe what is being removed above each code cell.

# Blank dictionary to store new models
models = {}
  • Remove chocolate
model12 = smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model12'] = model12.rsquared_adj
  • Remove fruity and its associated interactions
model13 = smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model13'] = model13.rsquared_adj
  • Remove sugarpercent
model14 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model14'] = model14.rsquared_adj
  • Remove pricepercent and its associated interactions
model15 = smf.ols('winpercent ~ chocolate + fruity + sugarpercent', data=candy_rankings).fit()
models['model15'] = model15.rsquared_adj
  • Remove pricepercent*fruity
model16 = smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent', data=candy_rankings).fit()
models['model16'] = model16.rsquared_adj

Exercise 7

Compare all models using the framework best_model_step3 = max(models, key=models.get):

best_model_step3 = max(models, key=models.get)
print(f'Best model in Exercise 6: {best_model_step3} with Adjusted R-squared: {models[best_model_step3]}')
Best model in Exercise 6: model14 with Adjusted R-squared: 0.43488989967424574
  • Which model is best:

None of the models in Exercise 6 resulted in a higher adjusted \(R^2_{adj}\). Therefore our final model is the one selected in Exercise 5.

selected_model = smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent + pricepercent*fruity', data=candy_rankings).fit()
print(selected_model.summary())

coefficients = selected_model.params
print(coefficients)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             winpercent   R-squared:                       0.469
Model:                            OLS   Adj. R-squared:                  0.435
Method:                 Least Squares   F-statistic:                     13.93
Date:                Mon, 19 Aug 2024   Prob (F-statistic):           9.38e-10
Time:                        13:38:21   Log-Likelihood:                -321.79
No. Observations:                  85   AIC:                             655.6
Df Residuals:                      79   BIC:                             670.2
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                      31.4753      4.271      7.369      0.000      22.974      39.977
chocolate[T.True]              20.7720      3.923      5.295      0.000      12.964      28.580
fruity[T.True]                 11.8091      5.130      2.302      0.024       1.598      22.020
sugarpercent                    8.1016      4.560      1.777      0.079      -0.974      17.177
pricepercent                    6.8970      6.770      1.019      0.311      -6.579      20.373
pricepercent:fruity[T.True]   -17.4218      9.943     -1.752      0.084     -37.213       2.369
==============================================================================
Omnibus:                        0.714   Durbin-Watson:                   1.634
Prob(Omnibus):                  0.700   Jarque-Bera (JB):                0.808
Skew:                           0.201   Prob(JB):                        0.668
Kurtosis:                       2.744   Cond. No.                         13.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Intercept                      31.475334
chocolate[T.True]              20.772023
fruity[T.True]                 11.809087
sugarpercent                    8.101581
pricepercent                    6.897010
pricepercent:fruity[T.True]   -17.421838
dtype: float64