# WTF is a Regression

re·gres·sion

/rəˈɡreSH(ə)n
noun
2. STATISTICS a measure of the relation between the mean value of one variable (e.g., output) and corresponding values of other variables (e.g., time and cost).

Regression analysis is commonly cited in the data science community (and science in general), a building block of statistics, and routinely referenced within the rapidly growing machine learning movement. So what is this mysterious math sorcery? Did Isaac Newton use regression analysis? These magical regressions seem important. We shall dig deeper. Here’s what I’ve found, thanks to MIT News, circa 2010:

To grasp the basic concept, take the simplest form of a regression: a linear, bivariate regression, which describes an unchanging relationship between two (and not more) phenomena. Now suppose you are wondering if there is a connection between the time high school students spend doing French homework, and the grades they receive. These types of data can be plotted as points on a graph, where the x-axis is the average number of hours per week a student studies, and the y-axis represents exam scores out of 100. Together, the data points will typically scatter a bit on the graph. The regression analysis creates the single line that best summarizes the distribution of points.

Ok, so it’s correlationary tool. Sounds useful. Additionally, consider the mathematical  equation representation of regressions:

Mathematically, the line representing a simple linear regression is expressed through a basic equation: Y = a0 + a1 X. Here X is hours spent studying per week, the “independent variable.” Y is the exam scores, the “dependent variable,” since — we believe — those scores depend on time spent studying. Additionally, a0 is the y-intercept (the value of Y when X is zero) and a1 is the slope of the line, characterizing the relationship between the two variables. (source: MIT News)

Geesh, that’s kinda dense. But can we do it with Python? Ah, yes we can. Below is the amassed code from Towards Data Science to run a basic regression that generates predictions from a Boston house values dataset within sci-kit learn.

pip install sklearn and pandas first, by entering in the terminal:

1. `pip install -U scikit-learn`
2. `python -m  pip install pandas`

Now run copy this code, save as a .py file  and run from your terminal or command prompt:

```from sklearn import linear_model
from sklearn import datasets ## imports datasets from scikit-learn
import pandas as pd

# define the data/predictors as the pre-set feature names
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

X = df
y = target["MEDV"]

lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)
print(predictions[0:5]) # print the first 5 predictions for y

lm.score(X,y) # This is the R² score of our model. As you probably remember, this the percentage of explained variance of the predictions.
lm.coef_ # check coefficients
lm.intercept_