Logistic regression

Lecture 17

Dr. Greg Chism

University of Arizona
INFO 511 - Spring 2025


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_theme(style="whitegrid", rc={"figure.figsize": (10, 6), "axes.labelsize": 16, "xtick.labelsize": 14, "ytick.labelsize": 14})

Recap: Modeling Loans

  • What is the practical difference between a model with parallel and non-parallel lines?

  • What is the definition of R-squared?

  • Why do we choose models based on adjusted R-squared and not R-squared?

Predict interest rate…

from credit utilization and homeownership

X = loans[['credit_util', 'homeownership']]
X = pd.get_dummies(X, drop_first=True).astype(float)
y = loans['interest_rate']

X = sm.add_constant(X)  
model = sm.OLS(y, X).fit()
                   Results: Ordinary least squares
Model:                OLS              Adj. R-squared:     0.068     
Dependent Variable:   interest_rate    AIC:                59859.3779
Date:                 2024-08-19 13:43 BIC:                59888.2185
No. Observations:     9998             Log-Likelihood:     -29926.   
Df Model:             3                F-statistic:        243.7     
Df Residuals:         9994             Prob (F-statistic): 1.25e-152 
R-squared:            0.068            Scale:              23.309    
                       Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
const                  9.9250   0.1401 70.8498 0.0000  9.6504 10.1996
credit_util            5.3356   0.2074 25.7266 0.0000  4.9291  5.7421
homeownership_Mortgage 0.6956   0.1208  5.7590 0.0000  0.4588  0.9323
homeownership_Own      0.1283   0.1552  0.8266 0.4085 -0.1760  0.4326
Omnibus:              1150.070       Durbin-Watson:          1.981   
Prob(Omnibus):        0.000          Jarque-Bera (JB):       1616.376
Skew:                 0.900          Prob(JB):               0.000   
Kurtosis:             3.800          Condition No.:          6       
[1] Standard Errors assume that the covariance matrix of the errors
is correctly specified.


  • Intercept: Loan applicants who rent and have 0 credit utilization are predicted to receive an interest rate of 9.93%, on average.


  • All else held constant, for each additional percent credit utilization is higher, interest rate is predicted to be higher, on average, by 0.0534%.

  • All else held constant, the model predicts that loan applicants who have a mortgage for their home receive 0.696% higher interest rate than those who rent their home, on average.

  • All else held constant, the model predicts that loan applicants who own their home receive 0.128% higher interest rate than those who rent their home, on average.


Predict log(interest rate)

X_log = loans[['credit_checks']]
X_log = sm.add_constant(X_log)
y_log = np.log(loans['interest_rate'])

model_log = sm.OLS(y_log, X_log).fit()


                            OLS Regression Results                            
Dep. Variable:          interest_rate   R-squared:                       0.020
Model:                            OLS   Adj. R-squared:                  0.020
Method:                 Least Squares   F-statistic:                     202.2
Date:                Mon, 19 Aug 2024   Prob (F-statistic):           1.91e-45
Time:                        13:43:48   Log-Likelihood:                -4912.6
No. Observations:                9998   AIC:                             9829.
Df Residuals:                    9996   BIC:                             9844.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
const             2.3947      0.005    467.428      0.000       2.385       2.405
credit_checks     0.0236      0.002     14.220      0.000       0.020       0.027
Omnibus:                      329.756   Durbin-Watson:                   2.002
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              152.256
Skew:                          -0.010   Prob(JB):                     8.67e-34
Kurtosis:                       2.396   Cond. No.                         4.17

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

\[ \widehat{log(interest~rate)} = 2.39 + 0.0236 \times credit~checks \]


For each additional credit check, log of interest rate is predicted to be higher, on average, by 0.0236%.


\[ log(interest~rate_{x+1}) - log(interest~rate_{x}) = 0.0236 \]

\[ log(\frac{interest~rate_{x+1}}{interest~rate_{x}}) = 0.0236 \]

\[ e^{log(\frac{interest~rate_{x+1}}{interest~rate_{x}})} = e^{0.0236} \]

\[ \frac{interest~rate_{x+1}}{interest~rate_{x}} = 1.024 \]

For each additional credit check, interest rate is predicted to be higher, on average, by a factor of 1.024.

Logistic regression

What is logistic regression?

  • Similar to linear regression…. but

  • Modeling tool when our response is categorical

Modelling binary outcomes

  • Variables with binary outcomes follow the Bernouilli distribution:

    • \(y_i \sim Bern(p)\)

    • \(p\): Probability of success

    • \(1-p\): Probability of failure

  • We can’t model \(y\) directly, so instead we model \(p\)

Linear model

\[ p_i = \beta_o + \beta_1 \times X_1 + \cdots + \epsilon \]

  • But remember that \(p\) must be between 0 and 1

  • We need a link function that transforms the linear model to have an appropriate range

This isn’t exactly what we need though…..

  • Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities.

  • We need the opposite of the link function… or the inverse

  • Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]

Generalized linear model

  • We model the logit (log-odds) of \(p\) :

\[ logit(p) = log \bigg( \frac{p}{1 - p} \bigg) = \beta_o + \beta_1 \times X1_i + \cdots + \epsilon \]

  • Then take the inverse to obtain the predicted \(p\):

\[ p_i = \frac{e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}}{1 + e^{\beta_o + \beta_1 \times X1_i + \cdots + \epsilon}} \]

A logistic model visualized


  • Generalized linear models allow us to fit models to predict non-continuous outcomes

  • Predicting binary outcomes requires modeling the log-odds of success, where p = probability of success