Linear regression with multiple predictors

Lecture 16

Dr. Greg Chism

University of Arizona
INFO 511 - Spring 2025

Model selection and overfitting

R-squared (\(R^2\))

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

\[ R^2 = 1 - \frac{RSS}{TSS} \]

\(R^2\) broken down

  • Residuals are the differences between the observed values and the predicted values from a regression model.

  • If \(y_i\) is an observed value and \(\hat{y}_i\) is the predicted value, the residual \(e_i\) is given by:

    \(e_i = y_i - \hat{y}_i\)

  • The mean \(\bar{y}\) is the average of all observed values, calculated as

    \(\bar{y}=\frac{1}{n} \sum_{i=1}^{n}y_i\)

  • Where \(n\) is the number of observations.

  • These are measures of variability within the data set.

  • Residual Sum of Squares (RSS):

    \(RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)

  • This measures the total deviation of the predicted values from the observed values.
  • Total Sum of Squares (TSS):

    \(TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2\)

  • This measures the total deviation of the observed values from their mean.

  • R-squared is calculated as:

    \(R^2 = 1 - \frac{RSS}{TSS}\)

  • This value ranges from 0 to 1 and indicates how well the independent variables explain the variability of the dependent variable.

Adjusted R-squared (\(R^2_{adj}\))

\[ R^2_{adj} = 1 - \frac{RSS / df_{res}}{TSS / df_{tot}} \]

  • \(df_{res}\) represents the degrees of freedom of the residuals, which is the number of observations minus the number of predictors minus one.

  • \(df_{tot}\)​ represents the degrees of freedom of the total variability, which is the number of observations minus one.

  • Penalizes Complexity: Adjusted R-squared decreases when unnecessary predictors are added to the model, discouraging overfitting.

  • Comparability: It is more reliable than R-squared for comparing models with different numbers of predictors.

  • Value Range: Unlike R-squared, adjusted R-squared can be negative if the model is worse than a simple mean model, though it typically ranges from 0 to 1.

  • \(df_{res}\): Degrees of freedom related to the estimate of the population variance around the model’s predictions.

  • \(df_{tot}\)​: Degrees of freedom related to the estimate of the population variance around the mean of the observed values.

In pursuit of Occam’s Razor

  • Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

  • Model selection follows this principle.

  • We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

  • In other words, we prefer the simplest best model, i.e. parsimonious model.