import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
# I got this from an online class:
= pd.read_csv("_data/loan.csv")
df = LogisticRegression()
logit
# Columns we care about:
= df[['loan_default', 'loan_amount', 'debt_to_income', 'annual_income']]
df # Split the data apart:
= df.drop('loan_default', axis=1), df.loan_default.astype('category')
X,y
= train_test_split(X,y, random_state=421)
X_train, X_test, y_train, y_test ; logit.fit(X_train, y_train)
What is Logistic Regression?
Instead of predicting the actual values like Linear Regression, Logistic Regression predicts the probability that an outcome belongs to a specific category. Note that the default assumption for this is Binomial - meaning between two different classes. You can also select multiple classes but will require a different package which we wont be discussing today.
This is a simple and common way to solve Classification problems which we’ll be looking at next.
Logisitc Regression in Python
Like most model problems, we’ll be falling back to using the sklearn
package. We’ll initialize the LogisiticRegression()
object, drop the columns we will be ignoring and then get our results:
… and we can check how well the model fits:
logit.score(X_test, y_test)
0.5963302752293578
You may see some posts saying you can do a One Hot Endcode using the LabelEncoder()
but with the category
type in Pandas you shouldn’t need this. This is because the Categorical data type is the equivalent of a Factor in R per the Documentation:
All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. cf: Docs
We’ll step through it anyways just to show that this ends up the same. We’ll import, convert and then give ths score below:
# generate encoder
= preprocessing.LabelEncoder();
le
# "fit" the column to the encoder:
'loan_default']); le.fit(df[
# convert the column for the response:
= le.transform(df['loan_default']);
y_2
# Same split as normal:
= train_test_split(X,y_2, random_state=421)
X_train, X_test, y_train, y_test
; logit.fit(X_train, y_train)
And, the score:
logit.score(X_test, y_test)
0.5963302752293578
And, it’s the same so skip it and just use categories.
Logistic Regression in R
Adding to the list of reasons I like using R more, this is how simple it is.
# install.packages(c('ISLR', 'caret'))
library(tidyverse)
library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
= read_csv('_data/loan.csv') default
Rows: 872 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): loan_default, loan_purpose, missed_payment_2_yr
dbl (5): loan_amount, interest_rate, installment, annual_income, debt_to_income
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary( default )
loan_default loan_purpose missed_payment_2_yr loan_amount
Length:872 Length:872 Length:872 Min. : 1000
Class :character Class :character Class :character 1st Qu.:10000
Mode :character Mode :character Mode :character Median :15700
Mean :17469
3rd Qu.:25000
Max. :40000
interest_rate installment annual_income debt_to_income
Min. : 4.720 Min. : 36.19 Min. : 3120 Min. : 0.00
1st Qu.: 7.492 1st Qu.: 279.48 1st Qu.: 46000 1st Qu.: 11.37
Median :10.220 Median : 451.46 Median : 67000 Median : 17.80
Mean :10.762 Mean : 517.35 Mean : 79334 Mean : 19.59
3rd Qu.:13.250 3rd Qu.: 737.44 3rd Qu.: 96000 3rd Qu.: 25.60
Max. :20.000 Max. :1566.59 Max. :780000 Max. :215.38
%>% head default
# A tibble: 6 × 8
loan_default loan_purpose misse…¹ loan_…² inter…³ insta…⁴ annua…⁵ debt_…⁶
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 no debt_consolidati… no 25000 5.47 855. 62823 39.4
2 yes medical no 10000 10.2 364. 40000 24.1
3 no small_business no 13000 6.22 442. 65000 14.0
4 no small_business no 36000 5.97 1152. 125000 8.09
5 yes small_business yes 12000 11.8 308. 65000 20.1
6 yes medical no 13000 13.2 333. 87000 18.4
# … with abbreviated variable names ¹missed_payment_2_yr, ²loan_amount,
# ³interest_rate, ⁴installment, ⁵annual_income, ⁶debt_to_income
<- default %>%
log_model mutate(did_default=as.factor(loan_default)) %>%
glm(
~ loan_amount + debt_to_income + annual_income,
did_default data = .,
family = binomial
)
summary( log_model )
Call:
glm(formula = did_default ~ loan_amount + debt_to_income + annual_income,
family = binomial, data = .)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3610 -0.9659 -0.8107 1.2511 3.0721
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.033e+00 2.155e-01 -4.796 1.62e-06 ***
loan_amount 2.518e-05 7.869e-06 3.200 0.00137 **
debt_to_income 2.898e-02 6.915e-03 4.192 2.77e-05 ***
annual_income -5.783e-06 1.885e-06 -3.068 0.00215 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1163.5 on 871 degrees of freedom
Residual deviance: 1113.9 on 868 degrees of freedom
AIC: 1121.9
Number of Fisher Scoring iterations: 4
There we go! Simple and easy as always.