## Simple linear regression from raw data

Let `$(x_i, y_i), i=1,2, \cdots , n$`

be $n$ pairs of observations.

The simple linear regression model of $Y$ on $X$ is

$$y_i=\beta_0 + \beta_1x_i +e_i$$ where,

- $y$ is a dependent variable,
- $x$ is an independent variable,
- $\beta_0$ is an intercept,
- $\beta_1$ is the slope,
- $e$ is the error term.

## Formula

By the method of least square, the model parameters $\beta_0$ and $\beta_1$ can be estimated as

The regression coefficients $\beta_0$ (intercept) and $\beta_1$ (slope) can be estimated as

`$\hat{\beta}_1 = \frac{n \sum xy - (\sum x)(\sum y)}{n(\sum x^2) -(\sum x)^2}$`

`$\hat{\beta}_0=\overline{y}-\hat{\beta}_1\overline{x}$`

where,

`$\overline{x}=\dfrac{1}{n}\sum_{i=1}^n x_i$`

is the sample mean of $X$,`$\overline{y}=\dfrac{1}{n}\sum_{i=1}^n y_i$`

is the sample mean of $Y$,- $n$ is the number of data points.

## Example

A study of the amount of rainfall and the quantity of air pollution removed produced the following data:

Daily Rainfall (0.01cm) | 4.3 | 4.5 | 5.9 | 5.6 | 6.1 | 5.2 | 3.8 | 2.1 | 7.5 |
---|---|---|---|---|---|---|---|---|---|

Particulate Removed ($\mu g/m^3$) | 126 | 121 | 116 | 118 | 114 | 118 | 132 | 141 | 108 |

a. Find the equation of the regression line to predict the particulate removed from the amount of daily rainfall.

b. Estimate the amount of particulate removed when the daily rainfall is $x=4.8$ units

### Solution

Let $x$ denote the daily rainfall and $y$ denote the particulate removed.

Let the simple linear regression model of $Y$ on $X$ is

$$y=\beta_0 + \beta_1x +e$$

By the method of least square, the estimates of $\beta_1$ and $\beta_0$ are respectively

```
$$
\begin{aligned}
\hat{\beta}_1 & = \frac{n \sum xy - (\sum x)(\sum y)}{n(\sum x^2) -(\sum x)^2}
\end{aligned}
$$
```

and

```
$$
\begin{aligned}
\hat{\beta}_0&=\overline{y}-\hat{\beta}_1\overline{x}
\end{aligned}
$$
```

$x$ | $y$ | $x^2$ | $y^2$ | $xy$ | |
---|---|---|---|---|---|

1 | 4.3 | 126 | 18.49 | 15876 | 541.8 |

2 | 4.5 | 121 | 20.25 | 14641 | 544.5 |

3 | 5.9 | 116 | 34.81 | 13456 | 684.4 |

4 | 5.6 | 118 | 31.36 | 13924 | 660.8 |

5 | 6.1 | 114 | 37.21 | 12996 | 695.4 |

6 | 5.2 | 118 | 27.04 | 13924 | 613.6 |

7 | 3.8 | 132 | 14.44 | 17424 | 501.6 |

8 | 2.1 | 141 | 4.41 | 19881 | 296.1 |

9 | 7.5 | 108 | 56.25 | 11664 | 810.0 |

Total | 45.0 | 1094 | 244.26 | 133786 | 5348.2 |

The sample mean of $x$ is
```
$$
\begin{aligned}
\overline{x}&=\frac{1}{n} \sum_{i=1}^n x_i\\
&=\frac{45}{9}\\
&=5
\end{aligned}
$$
```

The sample mean of $y$ is
```
$$
\begin{aligned}
\overline{y}&=\frac{1}{n} \sum_{i=1}^n y_i\\
&=\frac{1094}{9}\\
&=121.5556
\end{aligned}
$$
```

The estimate of $\beta_1$ is given by
```
$$
\begin{aligned}
b_1 & = \frac{n \sum xy - (\sum x)(\sum y)}{n(\sum x^2) -(\sum x)^2}\\
& = \frac{9*5348.2-(45)(1094)}{9*(244.26)-(45)^2}\\
&= \frac{-1096.2}{173.34}\\
&= -6.324.
\end{aligned}
$$
```

The estimate of intercept is

```
$$
\begin{aligned}
b_0&=\overline{y}-b_1\overline{x}\\
&=121.5556-(-6.324)*5\\
&=153.1756.
\end{aligned}
$$
```

The best fitted simple linear regression model to predict particulate removed from daily rainfall is
```
$$
\begin{aligned}
\hat{y} &= 153.1756+ (-6.324)*x
\end{aligned}
$$
```

The estimate of the amount particulate removed when the daily rainfall is $4.8$ (0.01 cm) is

```
$$
\begin{aligned}
\hat{y}&=153.1756 + (-6.324)\times 4.8\\
&= 122.8204\quad \mu g/m^3
\end{aligned}
$$
```

**Computation of various sum of squares**

$x$ | $y$ | $\hat{y}$ | $(y-\hat{y})^2$ | $(\hat{y}-\overline{y})^2$ | $(y-\overline{y})^2$ | |
---|---|---|---|---|---|---|

1 | 4.3 | 126 | 125.9823 | 0.0003 | 19.5961 | 19.7531 |

2 | 4.5 | 121 | 124.7175 | 13.8198 | 9.9979 | 0.3086 |

3 | 5.9 | 116 | 115.8640 | 0.0185 | 32.3938 | 30.8642 |

4 | 5.6 | 118 | 117.7612 | 0.0570 | 14.3971 | 12.6420 |

5 | 6.1 | 114 | 114.5992 | 0.3590 | 48.3909 | 57.0864 |

6 | 5.2 | 118 | 120.2908 | 5.2478 | 1.5996 | 12.6420 |

7 | 3.8 | 132 | 129.1443 | 8.1550 | 57.5890 | 109.0864 |

8 | 2.1 | 141 | 139.8951 | 1.2208 | 336.3389 | 378.0864 |

9 | 7.5 | 108 | 105.7456 | 5.0823 | 249.9547 | 183.7531 |

Total | 45.0 | 1094 | 33.9605 | 770.2580 | 804.2222 | 45.0000 |

- Explained variation

```
$$
\begin{aligned}
SSR &= \sum(\hat{y}-\overline{y})^2\\
&=770.258
\end{aligned}
$$
```

- Unexplained variation

```
$$
\begin{aligned}
SSE &= \sum (y-\hat{y})^2\\
&=33.9605
\end{aligned}
$$
```

- Total variation

```
$$
\begin{aligned}
SST &= \sum (y-\overline{y})^2\\
&=804.2222
\end{aligned}
$$
```

- Coefficient of determination

```
$$
\begin{aligned}
R^2 &=\frac{SSR}{SST}\\
&= \frac{770.258}{804.2222}\\
&=0.9578
\end{aligned}
$$
```

$95.78$ percent of the variation in dependent variable is explained by the independent variable.

- Standard error of estimate

```
$$
\begin{aligned}
S_e &= \sqrt{\frac{\sum(y-\hat{y})^2}{n-2}}\\
&=\sqrt{\frac{SSE}{n-2}}\\
&=\sqrt{\frac{33.9605}{7}}\\
&=2.2026
\end{aligned}
$$
```