Calculate Correlation and P-Value

Statistics, Tutorials

11 Feb 2024
by alasbahimoha

To calculate correlation and p-value between two variables, x and y, we first need to understand what these terms mean. Correlation measures how closely two variables move together. If you think of x and y as friends, how often do they move in the same direction when they play? The p-value tells us if the friendship (correlation) we see is real or just happened by chance.

Let’s imagine x and y are scores in two different games. To find out how much they move together, we can use a formula, but for now, let’s just think about the steps:

Collect Scores: Write down x and y’s scores in each game they played.
Calculate Correlation: Use a tool or formula to see how much they move together. A calculator or computer can help us here.
Find P-value: This step also involves a formula or tool to see if their movement together is real or just by chance.

Let’s start with some simple scores for x and y. Suppose x and y played 5 games, and their scores are as follows:

Game 1: x = 2, y = 3
Game 2: x = 4, y = 5
Game 3: x = 6, y = 7
Game 4: x = 8, y = 9
Game 5: x = 10, y = 11

Calculate the Correlation Coefficient (r):

This measures the strength and direction of a linear relationship between two variables. It’s the first step and what we already did. The formula for r involves taking each pair of scores, subtracting their means, multiplying these differences together, and then dividing by the standard deviations of x and y times the number of pairs minus one. This gives us a value between -1 and 1.

r = \frac{∑(x_i​−\bar{x})(y_i​−\bar{y​})} {\sqrt{∑(x_i-\bar{x}){^2}∑(y_i-\bar{y​}){^2}}}

Here’s what each symbol means:

xi and yi are the individual sample points for variables x and y, respectively.
xˉ and yˉ are the means (averages) of the x and y samples, respectively.
The summation (∑) indicates that you sum over all sample points.

Let’s calculate r step by step for our example:

Game scores were:
- x=[2,4,6,8,10]
- y=[3,5,7,9,11]

First, we need to calculate the mean of x and y, then apply the formula for r.

Let’s calculate the means (xˉ and yˉ) and then use them in our correlation formula.

To calculate the correlation coefficient (r) step by step, we first found the means:

Mean of x (xˉ) = 6.0
Mean of y (yˉ) = 7.0

Using these means in our formula, we calculated the correlation coefficient (r) and found it to be 1.0. This confirms our earlier result using a statistical function and shows a perfect positive linear relationship between x and y. Every step increase in x is matched by a step increase in y.

# Calculating means
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)

# Calculate the components of the correlation formula
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
denominator_x = sum((xi - mean_x)**2 for xi in x)
denominator_y = sum((yi - mean_y)**2 for yi in y)
denominator = (denominator_x * denominator_y)**0.5

# Calculate r
r_calculated = numerator / denominator

mean_x, mean_y, r_calculated

Transform the Correlation to a T-Statistic:

Once we have the correlation coefficient (r), we transform it into a t-statistic. This is done using the formula:

t= r \times \sqrt{\frac{n-2}{1-r{^2}}}

where n is the number of pairs (5 in our case), r is the correlation coefficient, and t is the t-statistic. This t-statistic tells us how much the observed correlation deviates from no correlation (0) in units of standard error.

Calculate the P-value:

The t-statistic is then used to calculate the p-value. The p-value is the probability of observing a correlation as strong as the one calculated (or stronger) if there was actually no correlation between the variables. This involves comparing the t-statistic to a t-distribution (a type of probability distribution used in statistics) with n−2 degrees of freedom. The area under the curve of this distribution, beyond the t-statistic, gives us the p-value.

Use a T-Distribution Table

With the t-statistic calculated and your degrees of freedom determined, the next step is to use a t-distribution table to find the p-value. T-distribution tables provide critical values for t-tests at different significance levels (e.g., α=0.05, α=0.01) and degrees of freedom.

Find the P-value

To find the p-value by hand:

Locate the row in the t-distribution table that corresponds to your degrees of freedom (df=n−2).
Look across this row to find the range within which your calculated t-statistic falls.
The columns represent significance levels. Your p-value is the smallest level of significance at which you would still reject the null hypothesis.

For a two-tailed test (testing for any correlation, positive or negative), you might need to double the one-tail p-value you find, depending on how the table is formatted.

Example Simplification

Let’s say you calculated a t-statistic of 2.5 with 8 degrees of freedom. In a t-table, you’d find the row for 8 degrees of freedom and look for the value closest to 2.5. If 2.5 falls between the critical values for α=0.05 and α=0.01, then your p-value is between 0.01 and 0.05. For more precision, statistical software or a calculator with statistical functions is recommended.

Sample size and the P-value:

When n, the sample size, is more than 30, the decision between using a t-statistic or another method to calculate the p-value depends on what you’re testing and the assumptions you can make about your data.

For correlation and p-value calculations specifically, the method doesn’t change much between small and large samples. The formula to calculate the correlation coefficient (r) and its significance (p-value) remains the same. However, the interpretation and the distribution used to calculate the p-value might adjust slightly based on the size of your sample and the normality of your data.

Small Samples (n < 30): For small samples, the t-distribution is used because it accounts for the additional uncertainty in the estimate of the standard deviation. The t-distribution is wider than the normal distribution, reflecting the increased variability expected in smaller samples.
Large Samples (n ≥ 30): For larger samples, the difference between the t-distribution and the normal distribution becomes negligible because of the central limit theorem, which states that the distribution of sample means will approximate a normal distribution as the sample size becomes large, regardless of the population’s distribution. In practice, for correlation coefficients and many other statistics, the method of calculating the p-value doesn’t change significantly between small and large samples because the calculation already accounts for the sample size.

The formula for transforming the correlation coefficient into a t-statistic and then using that to find a p-value is applicable for both small and large samples because it inherently adjusts for the size of the sample through the degrees of freedom (n – 2 in the formula).

For very large datasets, the p-value can become very small for even trivial differences or correlations because the statistical tests have a lot of power to detect even tiny effects. Therefore, it’s also important to consider the practical significance of the findings, not just the statistical significance indicated by the p-value.

Key Highlights: