## Simple linear regression from summarize data

Let `$(x_i, y_i), i=1,2, \cdots , n$`

be $n$ pairs of observations.

The simple linear regression model of $Y$ on $X$ is

$$y_i=\beta_0 + \beta_1x_i +e_i$$ where,

- $y$ is a dependent variable,
- $x$ is an independent variable,
- $\beta_0$ is an intercept,
- $\beta_1$ is the slope,
- $e$ is the error term.

## Formula

The simple linear regression model parameters $\beta_0$ and $\beta_1$ can be estimated using the method of least square.

The regression coefficients $\beta_1$ (slope) can be estimated as

`$\hat{\beta}_1 = \frac{Cov(x,y)}{V(x)}=\dfrac{s_{xy}}{s_x^2}=r\dfrac{s_y}{s_x}$`

The regression coefficients $\beta_0$ (intercept) can be estimated as

`$\hat{\beta}_0=\overline{y}-\hat{\beta}_1\overline{x}$`

where,

`$\overline{x}=\dfrac{1}{n}\sum_{i=1}^n x_i$`

is the sample mean of $X$,`$\overline{y}=\dfrac{1}{n}\sum_{i=1}^n y_i$`

is the sample mean of $Y$,`$V(x) = s_x^2$`

is variance of $X$,`$V(y) = s_y^2$`

is variance of $Y$,`$Cov(x,y) = s_{xy}$`

is covariance between $X$ and $Y$,`$r=\dfrac{Cov(x,y)}{\sqrt{V(x)V(y)}}$`

is the correlation coefficient between $X$ and $Y$,- $n$ is the number of data points.

## Example

For 42 firms the number of employees $x$ and the amount of expenses for continuing education $x$ (in EUR) were recorded. The statistical summary of the data set is given by:

Summary | Variable $x$ | Variable $y$ |
---|---|---|

Mean | 47 | 251 |

Variance | 103 | 1960 |

The covariance between $X$ and $Y$ is equal to 377.4.

Estimate the expected amount of money spent for continuing education by a firm with 39 employees using least squares regression.

### Solution

$X$ denote the number of employees and $Y$ denote amount of expenses for continuing education.

Given that $n=42$, mean of $X$ is $\overline{x}=47$, mean of $Y$ is $\overline{y}=251$, variance of $X$ is `$V(x)=s_x^2=103$`

, variance of $Y$ is `$V(y)=s_y^2=1960$`

and the covariance between $X$ and $Y$ is `$Cov(x,y) =s_{xy} = 377.6$`

.

The correlation coefficient between $X$ and $Y$ is

```
$$
\begin{aligned}
r&= \frac{Cov(x,y)}{\sqrt{V(x)V(y)}}\\
&=\frac{377.6}{\sqrt{103\times 1960}}\\
&=\frac{377.6}{\sqrt{201880}}\\
&=\frac{377.6}{449.3106}\\
&=0.8404
\end{aligned}
$$
```

The slope of the simple linear regression model is
```
$$
\begin{aligned}
\hat{\beta}_1 &= r \cdot \frac{s_y}{s_x}\\
&= 0.8404 \cdot \sqrt{\frac{1960}{103}}\\
&= 3.666.
\end{aligned}
$$
```

OR
```
$$
\begin{aligned}
\hat{\beta}_1 &=\frac{Cov(x,y)}{V(x)}\\
&=\frac{s_{xy}}{s_x^2}\\
&=\frac{377.6}{103}\\
&= 3.666.
\end{aligned}
$$
```

Intercept of the simple linear regression model is
```
$$
\begin{aligned}
\hat{\beta}_0 &= \overline{y} - \hat{\beta}_1 \cdot \overline{x}\\
&=251 - 3.666 \cdot 47\\
&= 78.698.
\end{aligned}
$$
```

Using the estimates of $\beta_0$ and $\beta_1$, the best fitted simple linear regression model to predict amount of expenses for continuing education from number of employees is
```
$$
\begin{aligned}
\hat{y} &= \hat{\beta}_0 + \hat{\beta}_1*x\\
&=78.698+ (3.666)*x
\end{aligned}
$$
```

The expected amount of expenses for continuing education when the number of employees $x=39$ is

```
$$
\begin{aligned}
\hat{y}&=78.698 + (3.666)\times 39\\
&= 221.672\quad \text{ EUR}
\end{aligned}
$$
```