<!DOCTYPE html>

2020-10-14-two-sample-t-tests-and-confidence-intervals

two sample t-tests and confidence intervals, the paired and unpaired cases

In this notebook we'll go over some of the theory and mechanics for the calculation of p-values and confidence intervals for hypothesis testing in the two sample case involving continuous data (cardinal) data.

Assumptions for the paired t-test

  1. Underlying distri- bution normal or central-limit theorem can be assumed to hold (large enough sample size, n > 30)
  2. Samples are related to each other.

From Section 8.2 of Fundamentals of Biostatistics by Bernard Rosner, 8th Edition

For a paired t-test
H0: ∆ = 0
H1: ∆ != 0

Perform a two-sided t test

p < 0.05 so we can conclude at the 95% confidence level that the two (related) samples come from different populations, and that, in this case, the oral contraceptive does indeed seem to affect blood pressure levels. IN other words we accept the alternative hypothesis

So the average blood pressure decreased from sample x1 to sample x2, with a mean difference of -4.8 units.

Even though are sample size is only 10, if the underlying random variable for the blood pressure of this population of women that the sample is taken from is normally distributed, or, if our sample size is large enough, then we can assume the CLT holds and use this t test.

I'm assuming that Dr. Rosner is assuming that the underlying random variable is normally distrbitued, since our sample size is only 10 here...

Paired 95% CI for the True Difference Between the Underlying Means of Two Paired Samples (Two-Sided)

From page 282 of Fundamentals of Biostatistics, 8th Edition, by Bernard Rosner:

95% CI = $(\hat{d} − t_{n−1,1−α/2} s_d/\sqrt n, \hat{d} + t_{n−1,1−α/2} s_d/\sqrt n)$

$s_{d}=\sqrt{\sum_{i=1}^{n} (d_{i}-\hat{d})^{2} /(n-1)}$

Assumptions for the independent t test

  1. Underlying population variances that the two samples are drawn from are equal
  2. Underlying populations are normally distributed

From Section 8.4 of Fundamentals of Biostatistics by Bernard Rosner, 8th Edition

Hypertension Suppose a sample of eight 35- to 39-year-old nonprenant, premeno- pausal OC users who work in a company and have a mean systolic blood pres- sure (SBP) of 132.86 mm Hg and sample standard deviation of 15.34 mm Hg are identified. A sample of 21 nonpregnant, premenopausal, non-OC users in the same age group are similarly identified who have mean SBP of 127.44 mm Hg and sample standard deviation of 18.23 mm Hg. What can be said about the underlying mean difference in blood pressure between the two groups?

Assume SBP is normally distributed in the first group with mean μ1 and variance σ1 and in the second group with mean μ2 and variance σ2. We want to test the hypothesis H0: μ1 = μ2 vs. H1: μ1 ≠ μ2. Assume in this section that the underlying variances in the two groups are the same (that is, σ12 = σ2 = σ2). The means and variances in the two samples are denoted by x1 x2 , s12 , s2 , respectively.

Even though the mean difference between the two samples in both experiments are similar, we get wildly different p values - the paired t-test example rejects H0, whereas the independent t-test example accepts H0.

What this proves for me is that paired t-test is much more sensitive to differences than the independent t-test. In other words, you need a much more significant difference between two independent samples to detect a difference than you would if the samples were paired. This makes intuitive sense because for paired data, we've (supposedly) accounted for many confounding variables, so we can attribute much more of the variance in the measurement of interest between the two (paired) samples to the intervention.

Calculating the 95% CI

Estimating the population variance by combining the sample variances - the combined variance is just a weighted sum of the individual variances:
pooled estimate of the variance (from page 287 of Fundamentals of Biostatistics): $$s^{2}=\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}$$

take the square root to get the combined standard deviation:
$$s = \sqrt{ \frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}}$$

As usual, the CI is calculated by taking the average (in this case the difference in means from the two samples) and adding/subtracting the t statistic mulitplied by the standard deviation of the sampling distribution. (for a full derivation, see page 287 from the text).

95% CI = $(\hat{x_1} - \hat{x_2} − t_{n_1+n_2-2,1−α/2} * s /\sqrt{1/n_1 + 1/n_2}, \hat{x_1} - \hat{x_2} + t_{n_1+n_2-2,1−α/2} * s/\sqrt{1/n_1 + 1/n_2})$

As expected, our CI is much wider and in this case includes 0, re-affirming our un-significant results (based on the p-value from our independent t-test).