### Linear Regression-Theory

**Linear **regression is a supervised machine learning technique where we need to predict a continuous output, which has a constant slope.

**There are two main types of linear regression: **

**1. Simple Regression:**

Through simple linear regression we predict response using single features.

If you recall, the line equation (**y = mx + c****)** we studied in schools. Let’s understand what these parameters say and how this equation works in linear regression.

Y = βo + β1X + ∈

Where, Y = Dependent Variable ( This is the variable we predict )

X = Independent Variable ( This is the variable we use to make a prediction )

βo – This is the intercept term. It is the prediction value you get when X = 0

β1 – This is the slope term. It explains the change in Y when X changes by 1 unit.

∈ – This represents the residual value, i.e. the difference between actual and predicted values.

#### 2. **Multivariable regression:**

It is nothing but extension of simple linear regression. It attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data.

Multi variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

f(x,y,z)=w1x+w2y+w3z

**Let’s understand it with example.**

In a company for sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w1Radio+w2TV+w3News

**Linear Regression geometrical representation**

**So
our goal in linear regression model is:**

Find a line or plane that best fits the data points. Here best fit means minimise the sum of errors across our training data.

**Types of Deliverable in linear regression:**

Typically there are following questions that a business wanted to know

- They wanted to know their sales or profit prediction.
- Drivers(What drives the sales?)
- All variable that have significant beta.
- Which factors are detrimental /incremental?
- All the drivers, which one should target first?(Variable with highest absolute value)

- Why you are predicting these values?
- To answer this question, you need calculate (
**beta*X**)for each X variable and you need to choose the highest value and accordingly you can choose your driver after that convince business why you have chosen the particular driver.

- To answer this question, you need calculate (

**So now the question arises how we calculate Beta values?**

To calculate beta we use **OLS**(ordinary least squared) method.

**Assumptions of Linear Regression:**

**1. X variables (Explanatory variable) should be linearly related to Y (Response Variable):**

**Meaning**:

If you plot a scatter plot between x variable and Y, most of the data point should be around the straight line.

**How to check?**

Draw the scatter plot between each x variable and y variable.

**What happens if the assumption is violated?**

MSE(Error) will be high.

**What to do if variable is not linear?**

- Drop the variable – But in this case will loose the information.
- Take log(x+1) of x variables.

**2.Residual or the y variable should be normally distributed:**

**Meaning:**

Residuals (errors) or Y, when plotted in a histogram produces a bell shaped curve.

**How to check?**

Plot a histogram of Y, when plotted histogram produces a bell- shaped curve then it follows normality.

Or we can also use q-q plot(quantile- quantile plot) of residuals

**What happens if the assumption is violated?**

It means all the P values has been calculated wrongly.

**What to do if assumption is violated?**

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

**3.There should not be any relationship between X variables (i.e no multicollinearity)**

**Meaning:**

X variable should not have any linear relationship between themselves. It’s obvious that we don’t want same information repeat mode.

**How to check?**

- Calculate correlation between every X with every other X variable.
- Second method is calculate VIF(Variance influence factor)

**What happens if the assumption is violated?**

Your beta value sign will fluctuate.

**What to do if assumption is violated?**

Drop those X variable whose VIF is greater than 10(VIF>10)

**4. The variance of error should remain constant over value of Y (Homoscedasticity/ No ** heteroskedasticity **)**

**Meaning:**

Spread of residuals should remain constant with values of Y.

**How to check? **

Draw scatter plot of residuals VS Y.

**What happens if the assumption is violated?**

Your P value will not accurate.

**What to do if assumption is violated?**

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

**5. There should not be any auto-correlation between the residuals**.

**Meaning:**

Correlation of residuals with lead residuals. Here lead residuals means next residual(Which we will see in next chapter )

**How to check?**

Use DW stats(Durbin Watson Stats)

If DW stats ~ 2, then no auto correlation.

**What happens if the assumption is violated?**

Your P value will not accurate.

**What to do if assumption is violated?**

Understand the reason why it is happening?

If autocorrelation is due to Y then cannot build linear regression model.

If autocorrelation is due to X then drop that X variable.

**In the next lecture we will see how to implement leaner regression in python**.

## Leave a Reply