# Logistic Regression in Python and R

What Now If It’s Not Linear?
R
python
data
analysis
test
Published

February 2, 2023

# What is Logistic Regression?

Instead of predicting the actual values like Linear Regression, Logistic Regression predicts the probability that an outcome belongs to a specific category. Note that the default assumption for this is Binomial - meaning between two different classes. You can also select multiple classes but will require a different package which we wont be discussing today.

This is a simple and common way to solve Classification problems which we’ll be looking at next.

## Logisitc Regression in Python

Like most model problems, we’ll be falling back to using the `sklearn` package. We’ll initialize the `LogisiticRegression()` object, drop the columns we will be ignoring and then get our results:

``````import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# I got this from an online class:
logit = LogisticRegression()

df = df[['loan_default', 'loan_amount', 'debt_to_income', 'annual_income']]
# Split the data apart:
X,y = df.drop('loan_default', axis=1), df.loan_default.astype('category')

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=421)
logit.fit(X_train, y_train);``````

… and we can check how well the model fits:

``logit.score(X_test, y_test)``
``0.5963302752293578``

You may see some posts saying you can do a One Hot Endcode using the `LabelEncoder()` but with the `category` type in Pandas you shouldn’t need this. This is because the Categorical data type is the equivalent of a Factor in R per the Documentation:

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. cf: Docs

We’ll step through it anyways just to show that this ends up the same. We’ll import, convert and then give ths score below:

``````# generate encoder
le = preprocessing.LabelEncoder();

# "fit" the column to the encoder:
le.fit(df['loan_default']);``````
``````# convert the column for the response:
y_2 = le.transform(df['loan_default']);

# Same split as normal:
X_train, X_test, y_train, y_test = train_test_split(X,y_2, random_state=421)

logit.fit(X_train, y_train);``````

And, the score:

``logit.score(X_test, y_test)``
``0.5963302752293578``

And, it’s the same so skip it and just use categories.

## Logistic Regression in R

Adding to the list of reasons I like using R more, this is how simple it is.

``````# install.packages(c('ISLR', 'caret'))
library(tidyverse)
library(caret)``````
``Loading required package: lattice``
``````
Attaching package: 'caret'``````
``````The following object is masked from 'package:purrr':

lift``````
``default = read_csv('_data/loan.csv')``
``Rows: 872 Columns: 8``
``````── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): loan_default, loan_purpose, missed_payment_2_yr
dbl (5): loan_amount, interest_rate, installment, annual_income, debt_to_income

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.``````
``summary( default )``
`````` loan_default       loan_purpose       missed_payment_2_yr  loan_amount
Length:872         Length:872         Length:872          Min.   : 1000
Class :character   Class :character   Class :character    1st Qu.:10000
Mode  :character   Mode  :character   Mode  :character    Median :15700
Mean   :17469
3rd Qu.:25000
Max.   :40000
interest_rate     installment      annual_income    debt_to_income
Min.   : 4.720   Min.   :  36.19   Min.   :  3120   Min.   :  0.00
1st Qu.: 7.492   1st Qu.: 279.48   1st Qu.: 46000   1st Qu.: 11.37
Median :10.220   Median : 451.46   Median : 67000   Median : 17.80
Mean   :10.762   Mean   : 517.35   Mean   : 79334   Mean   : 19.59
3rd Qu.:13.250   3rd Qu.: 737.44   3rd Qu.: 96000   3rd Qu.: 25.60
Max.   :20.000   Max.   :1566.59   Max.   :780000   Max.   :215.38  ``````
``default %>% head``
``````# A tibble: 6 × 8
loan_default loan_purpose      misse…¹ loan_…² inter…³ insta…⁴ annua…⁵ debt_…⁶
<chr>        <chr>             <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 no           debt_consolidati… no        25000    5.47    855.   62823   39.4
2 yes          medical           no        10000   10.2     364.   40000   24.1
3 no           small_business    no        13000    6.22    442.   65000   14.0
4 no           small_business    no        36000    5.97   1152.  125000    8.09
5 yes          small_business    yes       12000   11.8     308.   65000   20.1
6 yes          medical           no        13000   13.2     333.   87000   18.4
# … with abbreviated variable names ¹​missed_payment_2_yr, ²​loan_amount,
#   ³​interest_rate, ⁴​installment, ⁵​annual_income, ⁶​debt_to_income``````
``````log_model <- default %>%
mutate(did_default=as.factor(loan_default)) %>%
glm(
did_default ~ loan_amount + debt_to_income + annual_income,
data = .,
family = binomial
)

summary( log_model )``````
``````
Call:
glm(formula = did_default ~ loan_amount + debt_to_income + annual_income,
family = binomial, data = .)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.3610  -0.9659  -0.8107   1.2511   3.0721

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    -1.033e+00  2.155e-01  -4.796 1.62e-06 ***
loan_amount     2.518e-05  7.869e-06   3.200  0.00137 **
debt_to_income  2.898e-02  6.915e-03   4.192 2.77e-05 ***
annual_income  -5.783e-06  1.885e-06  -3.068  0.00215 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1163.5  on 871  degrees of freedom
Residual deviance: 1113.9  on 868  degrees of freedom
AIC: 1121.9

Number of Fisher Scoring iterations: 4``````

There we go! Simple and easy as always.