<!DOCTYPE html>
In this notebook we'll go over some of the theory and mechanics for the calculation of p-values and confidence intervals for hypothesis testing in the two sample case involving continuous data (cardinal) data.
import pandas as pd
import io
import requests
from IPython.core.display import display, HTML
import numpy as np
import matplotlib.pyplot as plt
import scipy
%matplotlib inline
pd.options.display.max_columns = 50
From Section 8.2 of Fundamentals of Biostatistics by Bernard Rosner, 8th Edition
x1 = np.array([115,112,107,119,115,138,126,105,104,115])
x2 = np.array([128,115,106,128,122,145,132,109,102,117])
diff = x2 - x1
print(diff)
[13 3 -1 9 7 7 6 4 -2 2]
print(np.std(x1), np.std(x2))
9.77957054271812 12.547509713086498
from scipy.stats import ttest_rel
For a paired t-test
H0: ∆ = 0
H1: ∆ != 0
Perform a two-sided t test
statistic, pvalue = ttest_rel(x1, x2)
print(statistic, pvalue)
-3.324651095085193 0.008874336881492044
p < 0.05 so we can conclude at the 95% confidence level that the two (related) samples come from different populations, and that, in this case, the oral contraceptive does indeed seem to affect blood pressure levels. IN other words we accept the alternative hypothesis
(x1 - x2).mean()
-4.8
So the average blood pressure decreased from sample x1 to sample x2, with a mean difference of -4.8 units.
len(x1)
10
Even though are sample size is only 10, if the underlying random variable for the blood pressure of this population of women that the sample is taken from is normally distributed, or, if our sample size is large enough, then we can assume the CLT holds and use this t test.
I'm assuming that Dr. Rosner is assuming that the underlying random variable is normally distrbitued, since our sample size is only 10 here...
From page 282 of Fundamentals of Biostatistics, 8th Edition, by Bernard Rosner:
95% CI = $(\hat{d} − t_{n−1,1−α/2} s_d/\sqrt n, \hat{d} + t_{n−1,1−α/2} s_d/\sqrt n)$
$s_{d}=\sqrt{\sum_{i=1}^{n} (d_{i}-\hat{d})^{2} /(n-1)}$
def std_paired(a, b):
assert(len(a) == len(b))
n = len(a)
d = b - a
var = np.sum((d - d.mean())**2) / (n - 1)
std = np.sqrt(var)
return std
# calculate the sample standard deviation of the paired data
std_paired(x1, x2)
4.565571644870382
def ci_paired(a, b, alpha=0.95):
'''Calculates the confidence interval corresponding to significance {alpha} for two related (i.e. "paired")
vectors a and b.
Parameters:
a -- {np.array}
b -- {np.array}
alpha -- {float}
Returns:
mean_diff -- {float}
ci -- {tuple}
'''
# make sure input vectors are the same length
assert(len(a) == len(b))
n = len(a)
# calculate mean difference
mean_diff = (b - a).mean()
# calculate standard deviation of paired samples
s_d = std_paired(a, b)
# generate a t distributed random variable
dof = n - 1 # calculate degrees of freedom
rv = scipy.stats.t(dof) # instantiate A Student’s t continuous random variable
# get the t statistics for degrees of freedom and specified alpha
t = rv.ppf((1 - alpha) / 2)
t = np.abs(t)
# calculate ci
print(mean_diff, t, s_d, np.sqrt(n))
ci = (mean_diff - t * (s_d / np.sqrt(n)), mean_diff + t * (s_d / np.sqrt(n)))
return mean_diff, ci
ci_paired(x1, x2, alpha=0.95)
4.8 2.2621571627409915 4.565571644870382 3.1622776601683795
(4.8, (1.5339867942207275, 8.066013205779273))
From Section 8.4 of Fundamentals of Biostatistics by Bernard Rosner, 8th Edition
Hypertension Suppose a sample of eight 35- to 39-year-old nonprenant, premeno- pausal OC users who work in a company and have a mean systolic blood pres- sure (SBP) of 132.86 mm Hg and sample standard deviation of 15.34 mm Hg are identified. A sample of 21 nonpregnant, premenopausal, non-OC users in the same age group are similarly identified who have mean SBP of 127.44 mm Hg and sample standard deviation of 18.23 mm Hg. What can be said about the underlying mean difference in blood pressure between the two groups?
Assume SBP is normally distributed in the first group with mean μ1 and variance σ1 and in the second group with mean μ2 and variance σ2. We want to test the hypothesis H0: μ1 = μ2 vs. H1: μ1 ≠ μ2. Assume in this section that the underlying variances in the two groups are the same (that is, σ12 = σ2 = σ2). The means and variances in the two samples are denoted by x1 x2 , s12 , s2 , respectively.
from scipy.stats import ttest_ind, ttest_ind_from_stats
# sample data
mean1 = 132.86
std1 = 15.34
nobs1 = 8
mean2 = 127.44
std2 = 18.23
nobs2 = 21
print(mean2 - mean1)
-5.420000000000016
statistic, pvalue = ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2)
print(statistic, pvalue)
0.7443175718105018 0.46311371461667283
Even though the mean difference between the two samples in both experiments are similar, we get wildly different p values - the paired t-test example rejects H0, whereas the independent t-test example accepts H0.
What this proves for me is that paired t-test is much more sensitive to differences than the independent t-test. In other words, you need a much more significant difference between two independent samples to detect a difference than you would if the samples were paired. This makes intuitive sense because for paired data, we've (supposedly) accounted for many confounding variables, so we can attribute much more of the variance in the measurement of interest between the two (paired) samples to the intervention.
Estimating the population variance by combining the sample variances - the combined variance is just a weighted sum of the individual variances:
pooled estimate of the variance (from page 287 of Fundamentals of Biostatistics):
$$s^{2}=\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}$$
take the square root to get the combined standard deviation:
$$s = \sqrt{ \frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}}$$
As usual, the CI is calculated by taking the average (in this case the difference in means from the two samples) and adding/subtracting the t statistic mulitplied by the standard deviation of the sampling distribution. (for a full derivation, see page 287 from the text).
95% CI = $(\hat{x_1} - \hat{x_2} − t_{n_1+n_2-2,1−α/2} * s /\sqrt{1/n_1 + 1/n_2}, \hat{x_1} - \hat{x_2} + t_{n_1+n_2-2,1−α/2} * s/\sqrt{1/n_1 + 1/n_2})$
def ci_ind(a, b, alpha=0.95):
'''Calculates the confidence interval corresponding to significance {alpha} for two independent
samples a and b.
Parameters:
a -- {np.array}
b -- {np.array}
alpha -- {float}
Returns:
mean_diff -- {float}
ci -- {tuple}
'''
# make sure input vectors are the same length
assert(len(a) == len(b))
n1 = len(a)
n2 = len(b)
s1 = np.std(a)
s2 = np.std(b)
var = ((n1 - 1)*s1**2 + (n2-1)*s2**2) / (n1 + n2 - 2)
std = np.sqrt(var)
# calculate mean difference
diff = b.mean() - a.mean()
# generate a t distributed random variable
dof = n1 + n2 - 2 # calculate degrees of freedom
rv = scipy.stats.t(dof) # instantiate A Student’s t continuous random variable
t = rv.ppf((1 - alpha) / 2)
t = np.abs(t)
# calculate ci
ci = (diff - t * (std / np.sqrt((1/n1) + (1/n2))), diff + t * (std / np.sqrt((1/n1) + (1/n2))))
return diff, ci
ci_ind(x1, x2)
(4.800000000000011, (-10.896250035340964, 20.496250035340985))
As expected, our CI is much wider and in this case includes 0, re-affirming our un-significant results (based on the p-value from our independent t-test).