Prediction + uncertainty

Lecture 18

Dr. Greg Chism

University of Arizona
INFO 511 - Fall 2024

Setup

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", rc={"figure.figsize": (8, 6), "axes.labelsize": 16, "xtick.labelsize": 14, "ytick.labelsize": 14})

Model selection

Data: Candy Rankings

candy_rankings = pd.read_csv("data/candy_rankings.csv")

candy_rankings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    85 non-null     object 
 1   chocolate         85 non-null     bool   
 2   fruity            85 non-null     bool   
 3   caramel           85 non-null     bool   
 4   peanutyalmondy    85 non-null     bool   
 5   nougat            85 non-null     bool   
 6   crispedricewafer  85 non-null     bool   
 7   hard              85 non-null     bool   
 8   bar               85 non-null     bool   
 9   pluribus          85 non-null     bool   
 10  sugarpercent      85 non-null     float64
 11  pricepercent      85 non-null     float64
 12  winpercent        85 non-null     float64
dtypes: bool(9), float64(3), object(1)
memory usage: 3.5+ KB

Full model

What percent of the variability in win percentages is explained by the model?

full_model = smf.ols('winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent', data=candy_rankings).fit()
print(f'R-squared: {full_model.rsquared.round(3)}')
print(f'Adjusted R-squared: {full_model.rsquared_adj.round(3)}')

R-squared: 0.54
Adjusted R-squared: 0.471

Akaike Information Criterion

\[ AIC = -2log(L) + 2k \]

\(L\): likelihood of the model
- Likelihood of seeing these data given the estimated model parameters
- Won’t go into calculating it in this course (but you will in future courses)
Used for model selection, lower the better
- Value is not informative on its own
Applies a penalty for number of parameters in the model, \(k\)
- Different penalty than adjusted \(R^2\) but similar idea

print(f'AIC: {full_model.aic}')

AIC: 655.2701107463729

Model selection – a little faster

selected_model = smf.ols('winpercent ~ chocolate + fruity + peanutyalmondy + crispedricewafer + hard + sugarpercent', data=candy_rankings).fit()

print(selected_model.summary2())

                     Results: Ordinary least squares
=========================================================================
Model:                  OLS                Adj. R-squared:       0.492   
Dependent Variable:     winpercent         AIC:                  647.5113
Date:                   2024-08-19 13:45   BIC:                  664.6099
No. Observations:       85                 Log-Likelihood:       -316.76 
Df Model:               6                  F-statistic:          14.54   
Df Residuals:           78                 Prob (F-statistic):   4.62e-11
R-squared:              0.528              Scale:                110.07  
-------------------------------------------------------------------------
                          Coef.  Std.Err.    t    P>|t|   [0.025   0.975]
-------------------------------------------------------------------------
Intercept                32.9406   3.5175  9.3647 0.0000  25.9377 39.9434
chocolate[T.True]        19.1470   3.5870  5.3379 0.0000  12.0059 26.2882
fruity[T.True]            8.8815   3.5606  2.4944 0.0147   1.7929 15.9701
peanutyalmondy[T.True]    9.4829   3.4464  2.7516 0.0074   2.6217 16.3440
crispedricewafer[T.True]  8.3851   4.4843  1.8699 0.0653  -0.5425 17.3127
hard[T.True]             -5.6693   3.2889 -1.7238 0.0887 -12.2170  0.8784
sugarpercent              7.9789   4.1289  1.9325 0.0569  -0.2410 16.1989
-------------------------------------------------------------------------
Omnibus:                 0.545           Durbin-Watson:             1.735
Prob(Omnibus):           0.761           Jarque-Bera (JB):          0.682
Skew:                    -0.093          Prob(JB):                  0.711
Kurtosis:                2.602           Condition No.:             6    
=========================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

Selected variables

variable	selected
chocolate	x
fruity	x
caramel
peanutyalmondy	x
nougat
crispedricewafer	x
hard	x
bar
pluribus
sugarpercent	x
pricepercent

Coefficient interpretation

Interpret the slopes of chocolate and sugarpercent in context of the data.

Intercept                   32.940573
chocolate[T.True]           19.147029
fruity[T.True]               8.881496
peanutyalmondy[T.True]       9.482858
crispedricewafer[T.True]     8.385138
hard[T.True]                -5.669297
sugarpercent                 7.978930
dtype: float64

AIC

As expected, the selected model has a smaller AIC than the full model. In fact, the selected model has the minimum AIC of all possible main effects models.

print(f'AIC: {full_model.aic}')

AIC: 655.2701107463729

print(f'AIC: {selected_model.aic}')

AIC: 647.5113366053722

Parismony

Look at the variables in the full and the selected model. Can you guess why some of them may have been dropped? Remember: We like parsimonious models.

variable	selected
chocolate	x
fruity	x
caramel
peanutyalmondy	x
nougat
crispedricewafer	x
hard	x
bar
pluribus
sugarpercent	x
pricepercent

Prediction

New observation

To make a prediction for a new observation we need to create a data frame with that observation.

Suppose we want to make a prediction for a candy that contains chocolate, isn’t fruity, has no peanuts or almonds, has a wafer, isn’t hard, and has a sugar content in the 20th percentile.

The following will result in an incorrect prediction. Why? How would you correct it?

new_candy = pd.DataFrame({'chocolate': [1], 'fruity': [0], 'peanutyalmondy': [0], 'crispedricewafer': [1], 'hard': [0], 'sugarpercent': [20]})

New observation, corrected

new_candy = pd.DataFrame({'chocolate': [1], 'fruity': [0], 'peanutyalmondy': [0], 'crispedricewafer': [1], 'hard': [0], 'sugarpercent': [0.20]})
new_candy

	chocolate	fruity	peanutyalmondy	crispedricewafer	hard	sugarpercent
0	1	0	0	1	0	0.2

Prediction

prediction = selected_model.predict(new_candy)
print(f'Prediction: {prediction}')

Prediction: 0    62.068526
dtype: float64

Uncertainty around prediction

Confidence interval around \(\hat{y}\) for new data (average win percentage for candy types with the given characteristics):

confidence_interval = selected_model.get_prediction(new_candy).conf_int()
print(f'Confidence Interval: {confidence_interval}')

Confidence Interval: [[53.65186346 70.48518939]]

Prediction interval around \(\hat{y}\) for new data (predicted score for an individual type of candy with the given characteristics ):

prediction_interval = selected_model.get_prediction(new_candy).summary_frame(alpha=0.05)
print(f'Prediction Interval: {prediction_interval}')

Prediction Interval:         mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
0  62.068526  4.227679      53.651863      70.485189     39.549428   

   obs_ci_upper  
0     84.587625