To calculate correlation and p-value between two variables, x and y, we first need to understand what these terms mean. Correlation measures how closely two variables move together. If you think of x and y as friends, how often do they move in the same direction when they play? The p-value tells us if the friendship (correlation) we see is real or just happened by chance.
Let’s imagine x and y are scores in two different games. To find out how much they move together, we can use a formula, but for now, let’s just think about the steps:
Let’s start with some simple scores for x and y. Suppose x and y played 5 games, and their scores are as follows:
This measures the strength and direction of a linear relationship between two variables. It’s the first step and what we already did. The formula for r involves taking each pair of scores, subtracting their means, multiplying these differences together, and then dividing by the standard deviations of x and y times the number of pairs minus one. This gives us a value between -1 and 1.
r = \frac{∑(x_i​−\bar{x})(y_i​−\bar{y​})} {\sqrt{∑(x_i-\bar{x}){^2}∑(y_i-\bar{y​}){^2}}}
Here’s what each symbol means:
Let’s calculate r step by step for our example:
First, we need to calculate the mean of x and y, then apply the formula for r.
Let’s calculate the means (xˉ and yˉ​) and then use them in our correlation formula.
To calculate the correlation coefficient (r) step by step, we first found the means:
Using these means in our formula, we calculated the correlation coefficient (r) and found it to be 1.0. This confirms our earlier result using a statistical function and shows a perfect positive linear relationship between x and y. Every step increase in x is matched by a step increase in y.
# Calculating means
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)
# Calculate the components of the correlation formula
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
denominator_x = sum((xi - mean_x)**2 for xi in x)
denominator_y = sum((yi - mean_y)**2 for yi in y)
denominator = (denominator_x * denominator_y)**0.5
# Calculate r
r_calculated = numerator / denominator
mean_x, mean_y, r_calculated
Once we have the correlation coefficient (r), we transform it into a t-statistic. This is done using the formula:
t= r \times \sqrt{\frac{n-2}{1-r{^2}}}
​ where n is the number of pairs (5 in our case), r is the correlation coefficient, and t is the t-statistic. This t-statistic tells us how much the observed correlation deviates from no correlation (0) in units of standard error.
The t-statistic is then used to calculate the p-value. The p-value is the probability of observing a correlation as strong as the one calculated (or stronger) if there was actually no correlation between the variables. This involves comparing the t-statistic to a t-distribution (a type of probability distribution used in statistics) with n−2 degrees of freedom. The area under the curve of this distribution, beyond the t-statistic, gives us the p-value.
With the t-statistic calculated and your degrees of freedom determined, the next step is to use a t-distribution table to find the p-value. T-distribution tables provide critical values for t-tests at different significance levels (e.g., α=0.05, α=0.01) and degrees of freedom.
To find the p-value by hand:
For a two-tailed test (testing for any correlation, positive or negative), you might need to double the one-tail p-value you find, depending on how the table is formatted.
Let’s say you calculated a t-statistic of 2.5 with 8 degrees of freedom. In a t-table, you’d find the row for 8 degrees of freedom and look for the value closest to 2.5. If 2.5 falls between the critical values for α=0.05 and α=0.01, then your p-value is between 0.01 and 0.05. For more precision, statistical software or a calculator with statistical functions is recommended.
When n, the sample size, is more than 30, the decision between using a t-statistic or another method to calculate the p-value depends on what you’re testing and the assumptions you can make about your data.
For correlation and p-value calculations specifically, the method doesn’t change much between small and large samples. The formula to calculate the correlation coefficient (r) and its significance (p-value) remains the same. However, the interpretation and the distribution used to calculate the p-value might adjust slightly based on the size of your sample and the normality of your data.
The formula for transforming the correlation coefficient into a t-statistic and then using that to find a p-value is applicable for both small and large samples because it inherently adjusts for the size of the sample through the degrees of freedom (n – 2 in the formula).
For very large datasets, the p-value can become very small for even trivial differences or correlations because the statistical tests have a lot of power to detect even tiny effects. Therefore, it’s also important to consider the practical significance of the findings, not just the statistical significance indicated by the p-value.