What is Linear Regression?

Linear Regression is a reliable and common method of measuring the relationship between numeric variables in data. The simplest usage is over two numeric variables and attempts to draw a line between all the available points. This is a bit of an oversimplification since you can also do categorical variables - at least for R, anyways - but this post is about the the How and not the Why.

Linear Regression in Python (Scikit-Learn)

Many languages have packages to solve problems like this for us and Python is no exception. The pacakges for this in Python are scikit-learn and statsmodels. Mostly commonly, you will see scikit-learn so we’ll talk about that first. To install it, you’ll want to run python3 -m pip install scikit-learn.

Most of the models you’ll want to build will come standard but you’ll need to use the documentation to find which one you’re after. We’ll be importing from linear_model to get LinearRegression:

# Grab this here:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import seaborn as sns

Next we’ll get the data from the the seaborn package since it conveniently contains common datasets - which we’ll be using now:


# Get the data for both:
iris = sns.load_dataset('iris')
X = iris.sepal_length
y = iris.petal_length

# Normally, you'd want to do this but I'm ignoring it for the purpose of example
# Set the size of the split based on your needs
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.8, random_state=1111)

Fitting a model in scikit-learn is extremly easy. All you need to do is initialize the model object, fit the model using the cleaned data and then run predictions.

lm = LinearRegression()
# This is one sample so you'll need to reshape; will throw error otherwise.
lm.fit(X.values.reshape(-1,1), y);

Finally, you can get the slope and the intercept of the line which is often what you’ll be after; we’ll need it for comparison later.

intercept, slope = lm.intercept_, lm.coef_[0]
print(f"The Slope and Intercept of the line are: Slope is {round(slope,2)} and Intercept is {round(intercept, 2)}")

The Slope and Intercept of the line are: Slope is 1.86 and Intercept is -7.1

Linear Regression in R

R is a language which is built around Statistics and as such much of the statistical tooling is built right into the default langauge. To build a Linear Model in R, one would use the lm() function passing using formula notation. If you’ve never seen this before, R has something called Formula Notation which allows you to specify a relationship in simple text and then gets parsed into meaningful input for the function. So, if we want to define a relationship between x and y then we’d say y ~ x. There are other choices for these as well as different functions using this notation in different ways but this is how we’re going to use it here.

Asking for a Linear Regression Model is as simple as:

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

model = lm(Petal.Length ~ Sepal.Length, data = iris)
print(model)


Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)

Coefficients:
 (Intercept)  Sepal.Length  
      -7.101         1.858

summary(model)


Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.47747 -0.59072 -0.00668  0.60484  2.49512 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

Notice that we get the same Slope and Intercept as we did in Python. When transferring across languages, it’s a good idea to use what you know to check against what is new for you.

Linear Regression in Python (statsmodels)

If you’re more comfortable with the R notation - like I happen to be - then using the statsmodel package is the way to go. This allows you to use the formula notation instead of the Object Oriented way that scikit-learn does. Since we’re already covered the forumula notation then we don’t have to review that; you’ll want the API interface for formulas but otherwise it’s the same as R.

import statsmodels.formula.api as smf
results = smf.ols('petal_length ~ sepal_length', data=iris).fit()
results.summary()

OLS Regression Results
Dep. Variable:	petal_length	R-squared:	0.760
Model:	OLS	Adj. R-squared:	0.758
Method:	Least Squares	F-statistic:	468.6
Date:	Sat, 28 Jan 2023	Prob (F-statistic):	1.04e-47
Time:	12:53:11	Log-Likelihood:	-190.57
No. Observations:	150	AIC:	385.1
Df Residuals:	148	BIC:	391.2
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-7.1014	0.507	-14.016	0.000	-8.103	-6.100
sepal_length	1.8584	0.086	21.646	0.000	1.689	2.028

Omnibus:	0.253	Durbin-Watson:	1.204
Prob(Omnibus):	0.881	Jarque-Bera (JB):	0.386
Skew:	-0.082	Prob(JB):	0.824
Kurtosis:	2.812	Cond. No.	43.4

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I really like this output and also because you can simply pull out the parts you need using dot notation:

results.params

Intercept      -7.101443
sepal_length    1.858433
dtype: float64

Thoughts

Linear Regression is ubiquitous and even people who are not statistically trained can understand that the line is a kind of Trend Line for the diretion and strength of the relationship. The real concern with tools such as these is that it is very easy to come up with results without understanding why they could be incorrect. I’ll go over the assumptions of a linear model in another post in the future.