Logistic Regression in Python and R

What Now If It’s Not Linear?
R
python
data
analysis
test
Published

February 2, 2023

What is Logistic Regression?

Instead of predicting the actual values like Linear Regression, Logistic Regression predicts the probability that an outcome belongs to a specific category. Note that the default assumption for this is Binomial - meaning between two different classes. You can also select multiple classes but will require a different package which we wont be discussing today.

This is a simple and common way to solve Classification problems which we’ll be looking at next.

Logisitc Regression in Python

Like most model problems, we’ll be falling back to using the sklearn package. We’ll initialize the LogisiticRegression() object, drop the columns we will be ignoring and then get our results:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# I got this from an online class:
df = pd.read_csv("_data/loan.csv")
logit = LogisticRegression()


# Columns we care about:
df = df[['loan_default', 'loan_amount', 'debt_to_income', 'annual_income']]
# Split the data apart:
X,y = df.drop('loan_default', axis=1), df.loan_default.astype('category')


X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=421)
logit.fit(X_train, y_train);

… and we can check how well the model fits:

logit.score(X_test, y_test)
0.5963302752293578

You may see some posts saying you can do a One Hot Endcode using the LabelEncoder() but with the category type in Pandas you shouldn’t need this. This is because the Categorical data type is the equivalent of a Factor in R per the Documentation:

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. cf: Docs

We’ll step through it anyways just to show that this ends up the same. We’ll import, convert and then give ths score below:

# generate encoder
le = preprocessing.LabelEncoder();

# "fit" the column to the encoder:
le.fit(df['loan_default']);
# convert the column for the response:
y_2 = le.transform(df['loan_default']);

# Same split as normal:
X_train, X_test, y_train, y_test = train_test_split(X,y_2, random_state=421)

logit.fit(X_train, y_train);

And, the score:

logit.score(X_test, y_test)
0.5963302752293578

And, it’s the same so skip it and just use categories.

Logistic Regression in R

Adding to the list of reasons I like using R more, this is how simple it is.

# install.packages(c('ISLR', 'caret'))
library(tidyverse)
library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
default = read_csv('_data/loan.csv')
Rows: 872 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): loan_default, loan_purpose, missed_payment_2_yr
dbl (5): loan_amount, interest_rate, installment, annual_income, debt_to_income

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary( default )
 loan_default       loan_purpose       missed_payment_2_yr  loan_amount   
 Length:872         Length:872         Length:872          Min.   : 1000  
 Class :character   Class :character   Class :character    1st Qu.:10000  
 Mode  :character   Mode  :character   Mode  :character    Median :15700  
                                                           Mean   :17469  
                                                           3rd Qu.:25000  
                                                           Max.   :40000  
 interest_rate     installment      annual_income    debt_to_income  
 Min.   : 4.720   Min.   :  36.19   Min.   :  3120   Min.   :  0.00  
 1st Qu.: 7.492   1st Qu.: 279.48   1st Qu.: 46000   1st Qu.: 11.37  
 Median :10.220   Median : 451.46   Median : 67000   Median : 17.80  
 Mean   :10.762   Mean   : 517.35   Mean   : 79334   Mean   : 19.59  
 3rd Qu.:13.250   3rd Qu.: 737.44   3rd Qu.: 96000   3rd Qu.: 25.60  
 Max.   :20.000   Max.   :1566.59   Max.   :780000   Max.   :215.38  
default %>% head
# A tibble: 6 × 8
  loan_default loan_purpose      misse…¹ loan_…² inter…³ insta…⁴ annua…⁵ debt_…⁶
  <chr>        <chr>             <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 no           debt_consolidati… no        25000    5.47    855.   62823   39.4 
2 yes          medical           no        10000   10.2     364.   40000   24.1 
3 no           small_business    no        13000    6.22    442.   65000   14.0 
4 no           small_business    no        36000    5.97   1152.  125000    8.09
5 yes          small_business    yes       12000   11.8     308.   65000   20.1 
6 yes          medical           no        13000   13.2     333.   87000   18.4 
# … with abbreviated variable names ¹​missed_payment_2_yr, ²​loan_amount,
#   ³​interest_rate, ⁴​installment, ⁵​annual_income, ⁶​debt_to_income
log_model <- default %>%
    mutate(did_default=as.factor(loan_default)) %>%
    glm(
        did_default ~ loan_amount + debt_to_income + annual_income,
        data = .,
        family = binomial
    )

summary( log_model )

Call:
glm(formula = did_default ~ loan_amount + debt_to_income + annual_income, 
    family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3610  -0.9659  -0.8107   1.2511   3.0721  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.033e+00  2.155e-01  -4.796 1.62e-06 ***
loan_amount     2.518e-05  7.869e-06   3.200  0.00137 ** 
debt_to_income  2.898e-02  6.915e-03   4.192 2.77e-05 ***
annual_income  -5.783e-06  1.885e-06  -3.068  0.00215 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1163.5  on 871  degrees of freedom
Residual deviance: 1113.9  on 868  degrees of freedom
AIC: 1121.9

Number of Fisher Scoring iterations: 4

There we go! Simple and easy as always.