import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as snsAE 12: Ultimate candy ranking
In this application exercise, we will:
Load the Penguins Dataset: Import and explore the dataset to understand its structure and the features available for analysis.
Preprocess the Data: Clean the data by handling missing values and standardize the numerical features for PCA.
Perform PCA: Apply Principal Component Analysis to reduce the dimensionality of the data and extract the principal components.
Visualize the PCA Result: Create a scatter plot of the principal components to visualize the clustering of different penguin species.
Examine the data
- We will use the
candy_rankings.csvdataset for this analysis.
candy_rankings = pd.read_csv('data/candy_rankings.csv')
candy_rankings.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 competitorname 85 non-null object
1 chocolate 85 non-null bool
2 fruity 85 non-null bool
3 caramel 85 non-null bool
4 peanutyalmondy 85 non-null bool
5 nougat 85 non-null bool
6 crispedricewafer 85 non-null bool
7 hard 85 non-null bool
8 bar 85 non-null bool
9 pluribus 85 non-null bool
10 sugarpercent 85 non-null float64
11 pricepercent 85 non-null float64
12 winpercent 85 non-null float64
dtypes: bool(9), float64(3), object(1)
memory usage: 3.5+ KB
Exercises
Use the variables:
chocolate, fruity, nougat, pricepercent, sugarpercent, sugarpercent*chocolate, pricepercent*fruity
Exercise 1
Create the full model and show the \(R^2_{adj}\):
full_model = smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
print(f'Adjusted R-squared: {full_model.rsquared_adj}')Adjusted R-squared: 0.42530696026839643
Is the model a good fit of the data?
The model moderately fits the data (42.5% variation explained).
Exercise 2
Produce all possible models removing 1 term at a time from the full model. Describe what is being removed above each code cell.
# Blank dictionary to store new models
models = {}- Remove
chocolateand it’s associated interaction
model1 = smf.ols('winpercent ~ fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model1'] = model1.rsquared_adj- Remove
fruityand its associated interaction
model2 = smf.ols('winpercent ~ chocolate * sugarpercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model2'] = model2.rsquared_adj- Remove
nougat
model3 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model3'] = model3.rsquared_adj- Remove
pricepercentand its associated interactions
model4 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + sugarpercent', data=candy_rankings).fit()
models['model4'] = model4.rsquared_adj- Remove
sugarpercent*chocolate
model5 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model5'] = model5.rsquared_adj- Remove
pricepercent*fruity
model6 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model6'] = model6.rsquared_adjExercise 3
Compare all models using the framework best_model_step1 = max(models, key=models.get):
best_model_step1 = max(models, key=models.get)
print(f'Best model in Exercise 2: {best_model_step1} with Adjusted R-squared: {models[best_model_step1]}')Best model in Exercise 2: model3 with Adjusted R-squared: 0.4318254523153029
- Which model is best:
Of the models from Exercises 1 and 2, the model with the highest adjusted \(R^2\) is the one with nougat removed. Therefore, we will go to Exercise 4 eliminating one variable at a time from that model.
Exercise 4
Create all possible models removing 1 term at a time from the model selected in the previous exercise. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
models = {}- Remove
chocolateand its associated interactions
model7 = smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model7'] = model7.rsquared_adj- Remove
fruityand its associated interactions
model8 = smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model8'] = model8.rsquared_adj- Remove
pricepercentand its associated interactions
model9 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + sugarpercent', data=candy_rankings).fit()
models['model9'] = model9.rsquared_adj- Remove
sugarpercent*chocolate
model10 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model10'] = model10.rsquared_adj- Remove
pricepercent*fruity
model11 = smf.ols('winpercent ~ chocolate * sugarpercent + fruity + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model11'] = model11.rsquared_adjExercise 5
Compare all models using the framework best_model_step2 = max(models, key=models.get):
best_model_step2 = max(models, key=models.get)
print(f'Best model in Exercise 4: {best_model_step2} with Adjusted R-squared: {models[best_model_step2]}')Best model in Exercise 4: model10 with Adjusted R-squared: 0.43488989967424574
- Which model is best:
The model with sugarpercent*chocolate has the highest \(R^2\) of all the models we’ve tested so far, so we will now go to Exercise 6 eliminating one variable at a time from this model.
Exercise 6
Create all possible models removing 1 term at a time from the model selected in the previous step. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
models = {}Remove chocolate
model12 = smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model12'] = model12.rsquared_adj- Remove
fruityand its associated interactions
model13 = smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model13'] = model13.rsquared_adj- Remove
sugarpercent
model14 = smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
models['model14'] = model14.rsquared_adj- Remove
pricepercentand its associated interactions
model15 = smf.ols('winpercent ~ chocolate + fruity + sugarpercent', data=candy_rankings).fit()
models['model15'] = model15.rsquared_adj- Remove
pricepercent*fruity
model16 = smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent', data=candy_rankings).fit()
models['model16'] = model16.rsquared_adjExercise 7
Compare all models using the framework best_model_step3 = max(models, key=models.get):
best_model_step3 = max(models, key=models.get)
print(f'Best model in Exercise 6: {best_model_step3} with Adjusted R-squared: {models[best_model_step3]}')Best model in Exercise 6: model14 with Adjusted R-squared: 0.43488989967424574
- Which model is best:
None of the models in Exercise 6 resulted in a higher adjusted \(R^2_{adj}\). Therefore our final model is the one selected in Exercise 5.
selected_model = smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent + pricepercent*fruity', data=candy_rankings).fit()
print(selected_model.summary())
coefficients = selected_model.params
print(coefficients) OLS Regression Results
==============================================================================
Dep. Variable: winpercent R-squared: 0.469
Model: OLS Adj. R-squared: 0.435
Method: Least Squares F-statistic: 13.93
Date: Thu, 25 Sep 2025 Prob (F-statistic): 9.38e-10
Time: 13:13:23 Log-Likelihood: -321.79
No. Observations: 85 AIC: 655.6
Df Residuals: 79 BIC: 670.2
Df Model: 5
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 31.4753 4.271 7.369 0.000 22.974 39.977
chocolate[T.True] 20.7720 3.923 5.295 0.000 12.964 28.580
fruity[T.True] 11.8091 5.130 2.302 0.024 1.598 22.020
sugarpercent 8.1016 4.560 1.777 0.079 -0.974 17.177
pricepercent 6.8970 6.770 1.019 0.311 -6.579 20.373
pricepercent:fruity[T.True] -17.4218 9.943 -1.752 0.084 -37.213 2.369
==============================================================================
Omnibus: 0.714 Durbin-Watson: 1.634
Prob(Omnibus): 0.700 Jarque-Bera (JB): 0.808
Skew: 0.201 Prob(JB): 0.668
Kurtosis: 2.744 Cond. No. 13.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Intercept 31.475334
chocolate[T.True] 20.772023
fruity[T.True] 11.809087
sugarpercent 8.101581
pricepercent 6.897010
pricepercent:fruity[T.True] -17.421838
dtype: float64