# Grab this here:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import seaborn as sns
What is Linear Regression?
Linear Regression is a reliable and common method of measuring the relationship between numeric variables in data. The simplest usage is over two numeric variables and attempts to draw a line between all the available points. This is a bit of an oversimplification since you can also do categorical variables - at least for R, anyways - but this post is about the the How and not the Why.
Linear Regression in Python (Scikit-Learn)
Many languages have packages to solve problems like this for us and Python is no exception. The pacakges for this in Python are scikit-learn
and statsmodels
. Mostly commonly, you will see scikit-learn
so we’ll talk about that first. To install it, you’ll want to run python3 -m pip install scikit-learn
.
Most of the models you’ll want to build will come standard but you’ll need to use the documentation to find which one you’re after. We’ll be importing from linear_model
to get LinearRegression
:
Next we’ll get the data from the the seaborn
package since it conveniently contains common datasets - which we’ll be using now:
# Get the data for both:
= sns.load_dataset('iris')
iris = iris.sepal_length
X = iris.petal_length
y
# Normally, you'd want to do this but I'm ignoring it for the purpose of example
# Set the size of the split based on your needs
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.8, random_state=1111)
Fitting a model in scikit-learn
is extremly easy. All you need to do is initialize the model object, fit the model using the cleaned data and then run predictions.
= LinearRegression()
lm # This is one sample so you'll need to reshape; will throw error otherwise.
-1,1), y); lm.fit(X.values.reshape(
Finally, you can get the slope and the intercept of the line which is often what you’ll be after; we’ll need it for comparison later.
= lm.intercept_, lm.coef_[0]
intercept, slope print(f"The Slope and Intercept of the line are: Slope is {round(slope,2)} and Intercept is {round(intercept, 2)}")
The Slope and Intercept of the line are: Slope is 1.86 and Intercept is -7.1
Linear Regression in R
R is a language which is built around Statistics and as such much of the statistical tooling is built right into the default langauge. To build a Linear Model in R, one would use the lm()
function passing using formula notation
. If you’ve never seen this before, R has something called Formula Notation which allows you to specify a relationship in simple text and then gets parsed into meaningful input for the function. So, if we want to define a relationship between x and y then we’d say y ~ x
. There are other choices for these as well as different functions using this notation in different ways but this is how we’re going to use it here.
Asking for a Linear Regression Model is as simple as:
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
= lm(Petal.Length ~ Sepal.Length, data = iris)
model print(model)
Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)
Coefficients:
(Intercept) Sepal.Length
-7.101 1.858
summary(model)
Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-2.47747 -0.59072 -0.00668 0.60484 2.49512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
Notice that we get the same Slope and Intercept as we did in Python. When transferring across languages, it’s a good idea to use what you know to check against what is new for you.
Linear Regression in Python (statsmodels)
If you’re more comfortable with the R notation - like I happen to be - then using the statsmodel
package is the way to go. This allows you to use the formula notation instead of the Object Oriented way that scikit-learn
does. Since we’re already covered the forumula notation then we don’t have to review that; you’ll want the API interface for formulas but otherwise it’s the same as R.
import statsmodels.formula.api as smf
= smf.ols('petal_length ~ sepal_length', data=iris).fit()
results results.summary()
Dep. Variable: | petal_length | R-squared: | 0.760 |
Model: | OLS | Adj. R-squared: | 0.758 |
Method: | Least Squares | F-statistic: | 468.6 |
Date: | Sat, 28 Jan 2023 | Prob (F-statistic): | 1.04e-47 |
Time: | 12:53:11 | Log-Likelihood: | -190.57 |
No. Observations: | 150 | AIC: | 385.1 |
Df Residuals: | 148 | BIC: | 391.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -7.1014 | 0.507 | -14.016 | 0.000 | -8.103 | -6.100 |
sepal_length | 1.8584 | 0.086 | 21.646 | 0.000 | 1.689 | 2.028 |
Omnibus: | 0.253 | Durbin-Watson: | 1.204 |
Prob(Omnibus): | 0.881 | Jarque-Bera (JB): | 0.386 |
Skew: | -0.082 | Prob(JB): | 0.824 |
Kurtosis: | 2.812 | Cond. No. | 43.4 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
I really like this output and also because you can simply pull out the parts you need using dot notation:
results.params
Intercept -7.101443
sepal_length 1.858433
dtype: float64
Thoughts
Linear Regression is ubiquitous and even people who are not statistically trained can understand that the line is a kind of Trend Line for the diretion and strength of the relationship. The real concern with tools such as these is that it is very easy to come up with results without understanding why they could be incorrect. I’ll go over the assumptions of a linear model in another post in the future.