import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
# I got this from an online class:
df = pd.read_csv("_data/loan.csv")
logit = LogisticRegression()
# Columns we care about:
df = df[['loan_default', 'loan_amount', 'debt_to_income', 'annual_income']]
# Split the data apart:
X,y = df.drop('loan_default', axis=1), df.loan_default.astype('category')
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=421)
logit.fit(X_train, y_train);What is Logistic Regression?
Instead of predicting the actual values like Linear Regression, Logistic Regression predicts the probability that an outcome belongs to a specific category. Note that the default assumption for this is Binomial - meaning between two different classes. You can also select multiple classes but will require a different package which we wont be discussing today.
This is a simple and common way to solve Classification problems which we’ll be looking at next.
Logisitc Regression in Python
Like most model problems, we’ll be falling back to using the sklearn package. We’ll initialize the LogisiticRegression() object, drop the columns we will be ignoring and then get our results:
… and we can check how well the model fits:
logit.score(X_test, y_test)0.6192660550458715You may see some posts saying you can do a One Hot Endcode using the LabelEncoder() but with the category type in Pandas you shouldn’t need this. This is because the Categorical data type is the equivalent of a Factor in R per the Documentation:
All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. cf: Docs
We’ll step through it anyways just to show that this ends up the same. We’ll import, convert and then give ths score below:
# generate encoder
le = preprocessing.LabelEncoder();
# "fit" the column to the encoder:
le.fit(df['loan_default']);# convert the column for the response:
y_2 = le.transform(df['loan_default']);
# Same split as normal:
X_train, X_test, y_train, y_test = train_test_split(X,y_2, random_state=421)
logit.fit(X_train, y_train);And, the score:
logit.score(X_test, y_test)0.6192660550458715And, it’s the same so skip it and just use categories.
Logistic Regression in R
Adding to the list of reasons I like using R more, this is how simple it is.
# install.packages(c('ISLR', 'caret'))
library(tidyverse)
library(caret)Loading required package: lattice
Attaching package: 'caret'The following object is masked from 'package:purrr':
    liftdefault = read_csv('_data/loan.csv')Rows: 872 Columns: 8── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): loan_default, loan_purpose, missed_payment_2_yr
dbl (5): loan_amount, interest_rate, installment, annual_income, debt_to_income
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.summary( default ) loan_default       loan_purpose       missed_payment_2_yr  loan_amount   
 Length:872         Length:872         Length:872          Min.   : 1000  
 Class :character   Class :character   Class :character    1st Qu.:10000  
 Mode  :character   Mode  :character   Mode  :character    Median :15700  
                                                           Mean   :17469  
                                                           3rd Qu.:25000  
                                                           Max.   :40000  
 interest_rate     installment      annual_income    debt_to_income  
 Min.   : 4.720   Min.   :  36.19   Min.   :  3120   Min.   :  0.00  
 1st Qu.: 7.492   1st Qu.: 279.48   1st Qu.: 46000   1st Qu.: 11.37  
 Median :10.220   Median : 451.46   Median : 67000   Median : 17.80  
 Mean   :10.762   Mean   : 517.35   Mean   : 79334   Mean   : 19.59  
 3rd Qu.:13.250   3rd Qu.: 737.44   3rd Qu.: 96000   3rd Qu.: 25.60  
 Max.   :20.000   Max.   :1566.59   Max.   :780000   Max.   :215.38  default %>% head# A tibble: 6 × 8
  loan_default loan_purpose       missed_payment_2_yr loan_amount interest_rate
  <chr>        <chr>              <chr>                     <dbl>         <dbl>
1 no           debt_consolidation no                        25000          5.47
2 yes          medical            no                        10000         10.2 
3 no           small_business     no                        13000          6.22
4 no           small_business     no                        36000          5.97
5 yes          small_business     yes                       12000         11.8 
6 yes          medical            no                        13000         13.2 
# ℹ 3 more variables: installment <dbl>, annual_income <dbl>,
#   debt_to_income <dbl>log_model <- default %>%
    mutate(did_default=as.factor(loan_default)) %>%
    glm(
        did_default ~ loan_amount + debt_to_income + annual_income,
        data = .,
        family = binomial
    )
summary( log_model )
Call:
glm(formula = did_default ~ loan_amount + debt_to_income + annual_income, 
    family = binomial, data = .)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3610  -0.9659  -0.8107   1.2511   3.0721  
Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.033e+00  2.155e-01  -4.796 1.62e-06 ***
loan_amount     2.518e-05  7.869e-06   3.200  0.00137 ** 
debt_to_income  2.898e-02  6.915e-03   4.192 2.77e-05 ***
annual_income  -5.783e-06  1.885e-06  -3.068  0.00215 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 1163.5  on 871  degrees of freedom
Residual deviance: 1113.9  on 868  degrees of freedom
AIC: 1121.9
Number of Fisher Scoring iterations: 4There we go! Simple and easy as always.