## Karl Pearson's Correlation Coefficient

Let `$(x_i, y_i), i=1,2, \cdots , n$`

be $n$ pairs of observations then the Karl Pearson's coefficient of correlation between two variables $X$ and $Y$ is denoted by $r_{xy}$ or $r$ and is given by

`$r = \dfrac{Cov(X,Y)}{\sqrt{Var(X) Var(Y)}}$`

where,

- the sample covariance between $x$ and $y$ is

`$$ \begin{aligned} Cov(x,y) =s_{xy}&=\frac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})(y_i-\overline{y})\\ &= \frac{1}{n-1}\bigg(\sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\bigg) \end{aligned} $$`

- the sample variance of $x$ is

`$$ \begin{aligned} V(x) =s_{x}^2 &=\frac{1}{n-1}\sum_{i=1}^{n}(x_i -\overline{x})^2\\ &= \frac{1}{n-1}\bigg(\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}\bigg) \end{aligned} $$`

- the sample variance of $y$ is

`$$ \begin{aligned} V(y) =s_{y}^2 &=\frac{1}{n-1}\sum_{i=1}^{n}(y_i -\overline{y})^2\\ &= \frac{1}{n-1}\bigg(\sum_{i=1}^n y_i^2 - \frac{(\sum_{i=1}^n y_i)^2}{n}\bigg) \end{aligned} $$`

- the sample mean of $x$ is

`$$ \begin{aligned} \overline{x}&=\frac{1}{n}\sum_{i=1}^n x_i \end{aligned} $$`

- the sample mean of $y$ is

`$$ \begin{aligned} \overline{y}&=\frac{1}{n}\sum_{i=1}^n y_i \end{aligned} $$`

The correlation coefficient $r$ can not exceed unity numerically. i.e. $|r|\leq 1 \implies -1 \leq r \leq +1$.

Two independent variables are uncorrelated. But the converse is not necessarily true.

## Interpretation

- If $r =0$, then there is no correlation between the ranks.
- If $r >0$, then there is a positive correlation between the ranks.

- If $r = 1$, then there is a perfect positive correlation between the ranks.
- If $0 <r < 1$, then there is a partially positive correlation between the ranks.

- If $r <0$, then there is a negative correlation between the ranks.

- If $r = -1$, then there is a perfect negative correlation between the ranks.
- If $-1 <r < 0$, then there is a partially negative correlation between the ranks.

## Example 1

A study was conducted to analyze the relationship between advertising expenditure and sales. The following data were recorded:

X Advertising (in $) | 20 | 24 | 30 | 32 | 35 |
---|---|---|---|---|---|

Y Sales (in $) | 310 | 340 | 400 | 420 | 490 |

Compute the correlation coefficient between advertising expenditure and sales.

### Solution

Let $x$ denote the advertising expenditure and $y$ denote the sales.

$x$ | $y$ | $x^2$ | $y^2$ | $xy$ | |
---|---|---|---|---|---|

1 | 20 | 310 | 400 | 96100 | 6200 |

2 | 24 | 340 | 576 | 115600 | 8160 |

3 | 30 | 400 | 900 | 160000 | 12000 |

4 | 32 | 420 | 1024 | 176400 | 13440 |

5 | 35 | 490 | 1225 | 240100 | 17150 |

Total | 141 | 1960 | 4125 | 788200 | 56950 |

The sample variance of $x$ is

`$$ \begin{aligned} s_{x}^2 & = \frac{1}{n-1}\bigg(\sum x^2 - \frac{(\sum x)^2}{n}\bigg)\\ & = \frac{1}{5-1}\bigg(4125-\frac{(141)^2}{5}\bigg)\\ &= \frac{1}{4}\bigg(4125-\frac{19881}{5}\bigg)\\ &= \frac{1}{4}\bigg(4125-3976.2\bigg)\\ &= \frac{148.8}{4}\\ &= 37.2. \end{aligned} $$`

The sample variance of $x$ is

`$$ \begin{aligned} s_{y}^2 & = \frac{1}{n-1}\bigg(\sum y^2 - \frac{(\sum y)^2}{n}\bigg)\\ & = \frac{1}{5-1}\bigg(788200-\frac{(1960)^2}{5}\bigg)\\ &= \frac{1}{4}\bigg(788200-\frac{3841600}{5}\bigg)\\ &= \frac{1}{4}\bigg(788200-768320\bigg)\\ &= \frac{19880}{4}\\ &= 4970. \end{aligned} $$`

The sample covariance between $x$ and $y$ is

`$$ \begin{aligned} s_{xy} & = \frac{1}{n-1}\bigg(\sum xy - \frac{(\sum x)(\sum y)}{n}\bigg)\\ & = \frac{1}{5-1}\bigg(56950-\frac{(141)(1960)}{5}\bigg)\\ &= \frac{1}{4}\bigg(56950-\frac{276360}{5}\bigg)\\ &= \frac{1}{4}\bigg(56950-55272\bigg)\\ &= \frac{1678}{4}\\ &= 419.5. \end{aligned} $$`

The Karl Pearson's sample correlation coefficient between **advertising expenditure** and **sales** is

`$$ \begin{aligned} r_{xy} & = \frac{Cov(x,y)}{\sqrt{V(x) V(y)}}\\ &= \frac{s_{xy}}{\sqrt{s_x^2s_y^2}}\\ &=\frac{419.5}{\sqrt{37.2\times 4970}}\\ &=\frac{419.5}{\sqrt{184884}}\\ &=0.9756. \end{aligned} $$`

The correlation coefficient between **advertising expenditure** and **sales** is $0.9756$. Since the value of correlation coefficient is positive, there is a strong positive relationship between advertising expenditure and sales.

## Example 2

A study of the amount of rainfall and the quantity of air pollution removed produced the following data:

Daily Rainfall (0.01cm) | 4.3 | 4.5 | 5.9 | 5.6 | 6.1 | 5.2 | 3.8 | 2.1 | 7.5 |
---|---|---|---|---|---|---|---|---|---|

Particulate Removed ($\mu g/m^3$) | 126 | 121 | 116 | 118 | 114 | 118 | 132 | 141 | 108 |

Calculate correlation coefficient between daily rainfall and particulate removed.

### Solution

Let $x$ denote the daily rainfall (0.01 cm) and $y$ denote the particulate removed ($\mu g/m^3$).

$x$ | $y$ | $x^2$ | $y^2$ | $xy$ | |
---|---|---|---|---|---|

1 | 4.3 | 126 | 18.49 | 15876 | 541.8 |

2 | 4.5 | 121 | 20.25 | 14641 | 544.5 |

3 | 5.9 | 116 | 34.81 | 13456 | 684.4 |

4 | 5.6 | 118 | 31.36 | 13924 | 660.8 |

5 | 6.1 | 114 | 37.21 | 12996 | 695.4 |

6 | 5.2 | 118 | 27.04 | 13924 | 613.6 |

7 | 3.8 | 132 | 14.44 | 17424 | 501.6 |

8 | 2.1 | 141 | 4.41 | 19881 | 296.1 |

9 | 7.5 | 108 | 56.25 | 11664 | 810.0 |

Total | 45.0 | 1094 | 244.26 | 133786 | 5348.2 |

The sample variance of $x$ is

`$$ \begin{aligned} s_{x}^2 & = \frac{1}{n-1}\bigg(\sum x^2 - \frac{(\sum x)^2}{n}\bigg)\\ & = \frac{1}{9-1}\bigg(244.26-\frac{(45)^2}{9}\bigg)\\ &= \frac{1}{8}\bigg(244.26-\frac{2025}{9}\bigg)\\ &= \frac{1}{8}\bigg(244.26-225\bigg)\\ &= \frac{19.26}{8}\\ &= 2.4075. \end{aligned} $$`

The sample variance of $x$ is

`$$ \begin{aligned} s_{y}^2 & = \frac{1}{n-1}\bigg(\sum y^2 - \frac{(\sum y)^2}{n}\bigg)\\ & = \frac{1}{9-1}\bigg(133786-\frac{(1094)^2}{9}\bigg)\\ &= \frac{1}{8}\bigg(133786-\frac{1196836}{9}\bigg)\\ &= \frac{1}{8}\bigg(133786-132981.7778\bigg)\\ &= \frac{804.2222}{8}\\ &= 100.5278. \end{aligned} $$`

The sample covariance between $x$ and $y$ is

`$$ \begin{aligned} s_{xy} & = \frac{1}{n-1}\bigg(\sum xy - \frac{(\sum x)(\sum y)}{n}\bigg)\\ & = \frac{1}{9-1}\bigg(5348.2-\frac{(45)(1094)}{9}\bigg)\\ &= \frac{1}{8}\bigg(5348.2-\frac{49230}{9}\bigg)\\ &= \frac{1}{8}\bigg(5348.2-5470\bigg)\\ &= \frac{-121.8}{8}\\ &= -15.225. \end{aligned} $$`

The Karl Pearson's sample correlation coefficient between **daily rainfall** and **particulate removed** is

`$$ \begin{aligned} r_{xy} & = \frac{Cov(x,y)}{\sqrt{V(x) V(y)}}\\ &= \frac{s_{xy}}{\sqrt{s_x^2s_y^2}}\\ &=\frac{-15.225}{\sqrt{2.4075\times 100.5278}}\\ &=\frac{-15.225}{\sqrt{242.0207}}\\ &=-0.9787. \end{aligned} $$`

The correlation coefficient between **daily rainfall** and **particulate removed** is $-0.9787$. Since the value of correlation coefficient is negative, there is a strong negative relationship between daily rainfall and particulate removed.