Calculate Correlation and P-Value

  • 11 Feb 2024
  • by alasbahimoha

To calculate correlation and p-value between two variables, x and y, we first need to understand what these terms mean. Correlation measures how closely two variables move together. If you think of x and y as friends, how often do they move in the same direction when they play? The p-value tells us if the friendship (correlation) we see is real or just happened by chance.

Let’s imagine x and y are scores in two different games. To find out how much they move together, we can use a formula, but for now, let’s just think about the steps:

  1. Collect Scores: Write down x and y’s scores in each game they played.
  2. Calculate Correlation: Use a tool or formula to see how much they move together. A calculator or computer can help us here.
  3. Find P-value: This step also involves a formula or tool to see if their movement together is real or just by chance.

Let’s start with some simple scores for x and y. Suppose x and y played 5 games, and their scores are as follows:

  • Game 1: x = 2, y = 3
  • Game 2: x = 4, y = 5
  • Game 3: x = 6, y = 7
  • Game 4: x = 8, y = 9
  • Game 5: x = 10, y = 11

Calculate the Correlation Coefficient (r):

This measures the strength and direction of a linear relationship between two variables. It’s the first step and what we already did. The formula for r involves taking each pair of scores, subtracting their means, multiplying these differences together, and then dividing by the standard deviations of x and y times the number of pairs minus one. This gives us a value between -1 and 1.

r = \frac{∑(x_i​−\bar{x})(y_i​−\bar{y​})} {\sqrt{∑(x_i-\bar{x}){^2}∑(y_i-\bar{y​}){^2}}}

Here’s what each symbol means:

  • xi and yi are the individual sample points for variables x and y, respectively.
  • xˉ and yˉ are the means (averages) of the x and y samples, respectively.
  • The summation (∑) indicates that you sum over all sample points.

Let’s calculate r step by step for our example:

  • Game scores were:
    • x=[2,4,6,8,10]
    • y=[3,5,7,9,11]

First, we need to calculate the mean of x and y, then apply the formula for r.

Let’s calculate the means (xˉ and yˉ​) and then use them in our correlation formula.

To calculate the correlation coefficient (r) step by step, we first found the means:

  • Mean of x (xˉ) = 6.0
  • Mean of y (yˉ​) = 7.0

Using these means in our formula, we calculated the correlation coefficient (r) and found it to be 1.0. This confirms our earlier result using a statistical function and shows a perfect positive linear relationship between x and y. Every step increase in x is matched by a step increase in y.

# Calculating means
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)

# Calculate the components of the correlation formula
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
denominator_x = sum((xi - mean_x)**2 for xi in x)
denominator_y = sum((yi - mean_y)**2 for yi in y)
denominator = (denominator_x * denominator_y)**0.5

# Calculate r
r_calculated = numerator / denominator

mean_x, mean_y, r_calculated

Transform the Correlation to a T-Statistic:

Once we have the correlation coefficient (r), we transform it into a t-statistic. This is done using the formula:

t= r \times \sqrt{\frac{n-2}{1-r{^2}}}

​ where n is the number of pairs (5 in our case), r is the correlation coefficient, and t is the t-statistic. This t-statistic tells us how much the observed correlation deviates from no correlation (0) in units of standard error.

Calculate the P-value:

The t-statistic is then used to calculate the p-value. The p-value is the probability of observing a correlation as strong as the one calculated (or stronger) if there was actually no correlation between the variables. This involves comparing the t-statistic to a t-distribution (a type of probability distribution used in statistics) with n−2 degrees of freedom. The area under the curve of this distribution, beyond the t-statistic, gives us the p-value.

Use a T-Distribution Table

With the t-statistic calculated and your degrees of freedom determined, the next step is to use a t-distribution table to find the p-value. T-distribution tables provide critical values for t-tests at different significance levels (e.g., α=0.05, α=0.01) and degrees of freedom.

Find the P-value

To find the p-value by hand:

  • Locate the row in the t-distribution table that corresponds to your degrees of freedom (df=n−2).
  • Look across this row to find the range within which your calculated t-statistic falls.
  • The columns represent significance levels. Your p-value is the smallest level of significance at which you would still reject the null hypothesis.

For a two-tailed test (testing for any correlation, positive or negative), you might need to double the one-tail p-value you find, depending on how the table is formatted.

Example Simplification

Let’s say you calculated a t-statistic of 2.5 with 8 degrees of freedom. In a t-table, you’d find the row for 8 degrees of freedom and look for the value closest to 2.5. If 2.5 falls between the critical values for α=0.05 and α=0.01, then your p-value is between 0.01 and 0.05. For more precision, statistical software or a calculator with statistical functions is recommended.

Sample size and the P-value:

When n, the sample size, is more than 30, the decision between using a t-statistic or another method to calculate the p-value depends on what you’re testing and the assumptions you can make about your data.

For correlation and p-value calculations specifically, the method doesn’t change much between small and large samples. The formula to calculate the correlation coefficient (r) and its significance (p-value) remains the same. However, the interpretation and the distribution used to calculate the p-value might adjust slightly based on the size of your sample and the normality of your data.

  1. Small Samples (n < 30): For small samples, the t-distribution is used because it accounts for the additional uncertainty in the estimate of the standard deviation. The t-distribution is wider than the normal distribution, reflecting the increased variability expected in smaller samples.
  2. Large Samples (n ≥ 30): For larger samples, the difference between the t-distribution and the normal distribution becomes negligible because of the central limit theorem, which states that the distribution of sample means will approximate a normal distribution as the sample size becomes large, regardless of the population’s distribution. In practice, for correlation coefficients and many other statistics, the method of calculating the p-value doesn’t change significantly between small and large samples because the calculation already accounts for the sample size.

The formula for transforming the correlation coefficient into a t-statistic and then using that to find a p-value is applicable for both small and large samples because it inherently adjusts for the size of the sample through the degrees of freedom (n – 2 in the formula).

For very large datasets, the p-value can become very small for even trivial differences or correlations because the statistical tests have a lot of power to detect even tiny effects. Therefore, it’s also important to consider the practical significance of the findings, not just the statistical significance indicated by the p-value.

Key Highlights: