import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
AE 12: Ultimate candy ranking
In this application exercise, we will:
Load the Penguins Dataset: Import and explore the dataset to understand its structure and the features available for analysis.
Preprocess the Data: Clean the data by handling missing values and standardize the numerical features for PCA.
Perform PCA: Apply Principal Component Analysis to reduce the dimensionality of the data and extract the principal components.
Visualize the PCA Result: Create a scatter plot of the principal components to visualize the clustering of different penguin species.
Examine the data
- We will use the
candy_rankings.csv
dataset for this analysis.
= pd.read_csv('data/candy_rankings.csv')
candy_rankings candy_rankings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 competitorname 85 non-null object
1 chocolate 85 non-null bool
2 fruity 85 non-null bool
3 caramel 85 non-null bool
4 peanutyalmondy 85 non-null bool
5 nougat 85 non-null bool
6 crispedricewafer 85 non-null bool
7 hard 85 non-null bool
8 bar 85 non-null bool
9 pluribus 85 non-null bool
10 sugarpercent 85 non-null float64
11 pricepercent 85 non-null float64
12 winpercent 85 non-null float64
dtypes: bool(9), float64(3), object(1)
memory usage: 3.5+ KB
Exercises
Use the variables:
chocolate
, fruity
, nougat
, pricepercent
, sugarpercent
, sugarpercent*chocolate
, pricepercent*fruity
Exercise 1
Create the full model and show the \(R^2_{adj}\):
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
full_model print(f'Adjusted R-squared: {full_model.rsquared_adj}')
Adjusted R-squared: 0.42530696026839643
Is the model a good fit of the data?
The model moderately fits the data (42.5% variation explained).
Exercise 2
Produce all possible models removing 1 term at a time from the full model. Describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
- Remove
chocolate
and it’s associated interaction
= smf.ols('winpercent ~ fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
model1 'model1'] = model1.rsquared_adj models[
- Remove
fruity
and its associated interaction
= smf.ols('winpercent ~ chocolate * sugarpercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
model2 'model2'] = model2.rsquared_adj models[
- Remove
nougat
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model3 'model3'] = model3.rsquared_adj models[
- Remove
pricepercent
and its associated interactions
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + sugarpercent', data=candy_rankings).fit()
model4 'model4'] = model4.rsquared_adj models[
- Remove
sugarpercent*chocolate
= smf.ols('winpercent ~ chocolate + fruity * pricepercent + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
model5 'model5'] = model5.rsquared_adj models[
- Remove
pricepercent*fruity
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity + nougat + pricepercent + sugarpercent', data=candy_rankings).fit()
model6 'model6'] = model6.rsquared_adj models[
Exercise 3
Compare all models using the framework best_model_step1 = max(models, key=models.get)
:
= max(models, key=models.get)
best_model_step1 print(f'Best model in Exercise 2: {best_model_step1} with Adjusted R-squared: {models[best_model_step1]}')
Best model in Exercise 2: model3 with Adjusted R-squared: 0.4318254523153029
- Which model is best:
Of the models from Exercises 1 and 2, the model with the highest adjusted \(R^2\) is the one with nougat
removed. Therefore, we will go to Exercise 4 eliminating one variable at a time from that model.
Exercise 4
Create all possible models removing 1 term at a time from the model selected in the previous exercise. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
- Remove
chocolate
and its associated interactions
= smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model7 'model7'] = model7.rsquared_adj models[
- Remove
fruity
and its associated interactions
= smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model8 'model8'] = model8.rsquared_adj models[
- Remove
pricepercent
and its associated interactions
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity + sugarpercent', data=candy_rankings).fit()
model9 'model9'] = model9.rsquared_adj models[
- Remove
sugarpercent*chocolate
= smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model10 'model10'] = model10.rsquared_adj models[
- Remove
pricepercent*fruity
= smf.ols('winpercent ~ chocolate * sugarpercent + fruity + pricepercent + sugarpercent', data=candy_rankings).fit()
model11 'model11'] = model11.rsquared_adj models[
Exercise 5
Compare all models using the framework best_model_step2 = max(models, key=models.get)
:
= max(models, key=models.get)
best_model_step2 print(f'Best model in Exercise 4: {best_model_step2} with Adjusted R-squared: {models[best_model_step2]}')
Best model in Exercise 4: model10 with Adjusted R-squared: 0.43488989967424574
- Which model is best:
The model with sugarpercent*chocolate
has the highest \(R^2\) of all the models we’ve tested so far, so we will now go to Exercise 6 eliminating one variable at a time from this model.
Exercise 6
Create all possible models removing 1 term at a time from the model selected in the previous step. Again, describe what is being removed above each code cell.
# Blank dictionary to store new models
= {} models
Remove chocolate
= smf.ols('winpercent ~ fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model12 'model12'] = model12.rsquared_adj models[
- Remove
fruity
and its associated interactions
= smf.ols('winpercent ~ chocolate * sugarpercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model13 'model13'] = model13.rsquared_adj models[
- Remove
sugarpercent
= smf.ols('winpercent ~ chocolate + fruity * pricepercent + pricepercent + sugarpercent', data=candy_rankings).fit()
model14 'model14'] = model14.rsquared_adj models[
- Remove
pricepercent
and its associated interactions
= smf.ols('winpercent ~ chocolate + fruity + sugarpercent', data=candy_rankings).fit()
model15 'model15'] = model15.rsquared_adj models[
- Remove
pricepercent*fruity
= smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent', data=candy_rankings).fit()
model16 'model16'] = model16.rsquared_adj models[
Exercise 7
Compare all models using the framework best_model_step3 = max(models, key=models.get)
:
= max(models, key=models.get)
best_model_step3 print(f'Best model in Exercise 6: {best_model_step3} with Adjusted R-squared: {models[best_model_step3]}')
Best model in Exercise 6: model14 with Adjusted R-squared: 0.43488989967424574
- Which model is best:
None of the models in Exercise 6 resulted in a higher adjusted \(R^2_{adj}\). Therefore our final model is the one selected in Exercise 5.
= smf.ols('winpercent ~ chocolate + fruity + sugarpercent + pricepercent + pricepercent*fruity', data=candy_rankings).fit()
selected_model print(selected_model.summary())
= selected_model.params
coefficients print(coefficients)
OLS Regression Results
==============================================================================
Dep. Variable: winpercent R-squared: 0.469
Model: OLS Adj. R-squared: 0.435
Method: Least Squares F-statistic: 13.93
Date: Mon, 19 Aug 2024 Prob (F-statistic): 9.38e-10
Time: 13:38:21 Log-Likelihood: -321.79
No. Observations: 85 AIC: 655.6
Df Residuals: 79 BIC: 670.2
Df Model: 5
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 31.4753 4.271 7.369 0.000 22.974 39.977
chocolate[T.True] 20.7720 3.923 5.295 0.000 12.964 28.580
fruity[T.True] 11.8091 5.130 2.302 0.024 1.598 22.020
sugarpercent 8.1016 4.560 1.777 0.079 -0.974 17.177
pricepercent 6.8970 6.770 1.019 0.311 -6.579 20.373
pricepercent:fruity[T.True] -17.4218 9.943 -1.752 0.084 -37.213 2.369
==============================================================================
Omnibus: 0.714 Durbin-Watson: 1.634
Prob(Omnibus): 0.700 Jarque-Bera (JB): 0.808
Skew: 0.201 Prob(JB): 0.668
Kurtosis: 2.744 Cond. No. 13.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Intercept 31.475334
chocolate[T.True] 20.772023
fruity[T.True] 11.809087
sugarpercent 8.101581
pricepercent 6.897010
pricepercent:fruity[T.True] -17.421838
dtype: float64