WTF is a Regression


wtf is a regression? – MIT News and I

1. a return to a former or less developed state.
2. STATISTICS a measure of the relation between the mean value of one variable (e.g., output) and corresponding values of other variables (e.g., time and cost).

Regression analysis is commonly cited in the data science community (and science in general), a building block of statistics, and routinely referenced within the rapidly growing machine learning movement. So what is this mysterious math sorcery? Did Isaac Newton use regression analysis? These magical regressions seem important. We shall dig deeper. Here’s what I’ve found, thanks to MIT News, circa 2010:

To grasp the basic concept, take the simplest form of a regression: a linear, bivariate regression, which describes an unchanging relationship between two (and not more) phenomena. Now suppose you are wondering if there is a connection between the time high school students spend doing French homework, and the grades they receive. These types of data can be plotted as points on a graph, where the x-axis is the average number of hours per week a student studies, and the y-axis represents exam scores out of 100. Together, the data points will typically scatter a bit on the graph. The regression analysis creates the single line that best summarizes the distribution of points.

Ok, so it’s correlationary tool. Sounds useful. Additionally, consider the mathematical  equation representation of regressions:

Mathematically, the line representing a simple linear regression is expressed through a basic equation: Y = a0 + a1 X. Here X is hours spent studying per week, the “independent variable.” Y is the exam scores, the “dependent variable,” since — we believe — those scores depend on time spent studying. Additionally, a0 is the y-intercept (the value of Y when X is zero) and a1 is the slope of the line, characterizing the relationship between the two variables. (source: MIT News)

Geesh, that’s kinda dense. But can we do it with Python? Ah, yes we can. Below is the amassed code from Towards Data Science to run a basic regression that generates predictions from a Boston house values dataset within sci-kit learn.

pip install sklearn and pandas first, by entering in the terminal:

  1. pip install -U scikit-learn
  2. python -m  pip install pandas

Now run copy this code, save as a .py file  and run from your terminal or command prompt:

from sklearn import linear_model
from sklearn import datasets ## imports datasets from scikit-learn
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library

# define the data/predictors as the pre-set feature names
df = pd.DataFrame(, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(, columns=["MEDV"])

X = df
y = target["MEDV"]

lm = linear_model.LinearRegression()
model =,y)
predictions = lm.predict(X)
print(predictions[0:5]) # print the first 5 predictions for y

lm.score(X,y) # This is the R² score of our model. As you probably remember, this the percentage of explained variance of the predictions.
lm.coef_ # check coefficients
# More info:

source: Towards Data Science

The above example is one of many predictive models. Logistic regressions and random forests are examples of other models. This book dives deeper into them.

So there you have it: regressions… part correlation measurement tool between two variables, part fancy mathematical formula, part prediction generator that sci-kit learn graciously affords us to make predictions of future data. Alright regressions, you can stay. 🙂

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.