Chapter 2 Introduction to Hypothesis testing

2.1 Hypothesis testing and p value

If you haven’t heard much about hypothesis testing or p value, you will hear them again and again throughout of your whole journey of scientific study and research. This might be your first time study it, but very likely will not be the last time study it (via other courses or self-study). Its importance will not be over rated, as misuse or misunderstanding can easily lead to errors in scientific conclusions.

Before we started with examples, I will concisely summarize the key elements here, so you can come back after read through this chapter (or anytime you need to review).

When to perform hypothesis test - unsure if the observed results are caused by randomness (randomness are everywhere but understanding or even admitting them are not always easy)

What are the key elements in the hypothesis test - choose a proper testing statistics - approximate the null distribution of the test statistic

What is the major challenge? - understand the randomness of certain variables in the system / process - estimate the testing statistic under the null hypothesis - find a more effective test statistic (or the null distribution estimateion)

That’s why you need to study more statistical course e.g., STAT2601 for the mathematical of common variable distribution (or its more application-focused alternative BIOF2013) and STAT2602 for advanced skills in deriving analytic or asymptotic null distribution.

How surprising is my result? Calculating a p-value

There are many circumstances where we simply want to check whether an observation looks like it is compatible with the null hypothesis, \(H_{0}\).

Having decided on a significance level \(\alpha\) and whether the situation warrants a one-tailed or a two-tailed test, we can use the cdf of the null distribution to calculate a p-value for the observation.

Acknowledgement: examples are from Dr John Pinney link here

2.1.1 Example 1: probability of rolling a six?

Your arch-nemesis Blofeld always seems to win at ludo, and you have started to suspect him of using a loaded die.

You observe the following outcomes from 100 rolls of his die:

data = c(6, 1, 5, 6, 2, 6, 4, 3, 4, 6, 1, 2, 5, 6, 6, 3, 6, 2, 6, 4, 6, 2,
       5, 4, 2, 3, 3, 6, 6, 1, 2, 5, 6, 4, 6, 2, 1, 3, 6, 5, 4, 5, 6, 3,
       6, 6, 1, 4, 6, 6, 6, 6, 6, 2, 3, 1, 6, 4, 3, 6, 2, 4, 6, 6, 6, 5,
       6, 2, 1, 6, 6, 4, 3, 6, 5, 6, 6, 2, 6, 3, 6, 6, 1, 4, 6, 4, 2, 6,
       6, 5, 2, 6, 6, 4, 3, 1, 6, 6, 5, 5)

Do you have enough evidence to confront him?

# We will work with the binomial distribution for the observed number of sixes

# Write down the hypotheses
# H0: p = 1/6
# H1: p > 1/6

# choose a significance level
# alpha = 0.01

# number of sixes
# number of trials

stat_k = sum(data == 6)
trials = length(data)

print(paste("number of sixes:", stat_k))
#> [1] "number of sixes: 43"
print(paste("number of trials:", trials))
#> [1] "number of trials: 100"

# test statistic: number of sixes out of 100 trials
# null distribution: dbinom(x, size=100, prob=1/6)
# calculate p value

p_val = 1 - pbinom(stat_k - 1, size=trials, prob=1/6)

print(paste("Observed statistic is", stat_k))
#> [1] "Observed statistic is 43"
print(paste("p value is", p_val))
#> [1] "p value is 5.43908695860296e-10"

2.1.1.1 Visualize the null distribution and the test statistic

# plot the probability mass function of null distribution

x = seq(0, 101)
pmf = dbinom(x, size=100, prob=1/6)
df = data.frame(x=x, pmf=pmf, extreme=(x >= stat_k))

library(ggplot2)

ggplot(df, aes(x=x)) + 
  geom_point(aes(y=pmf, color=extreme)) +
  scale_color_manual(values=c("black", "red")) +
  xlab('Number of sixes') + 
  ylab('Probability Mass Function') +
  ggtitle('Distribution of n_six under the null hypothesis')

2.2 Permutation test

2.2.1 Example 2: difference in birth weight

The birth weights of babies (in kg) have been measured for a sample of mothers split into two categories: nonsmoking and heavy smoking.

The two categories are measured independently from each other.
Both come from normal distributions
The two groups are assumed to have the same unknown variance.

data_heavysmoking = c(3.18, 2.84, 2.90, 3.27, 3.85, 3.52, 3.23, 2.76, 
                      3.60, 3.75, 3.59, 3.63, 2.38, 2.34, 2.44) 
data_nonsmoking   = c(3.99, 3.79, 3.60, 3.73, 3.21, 3.60, 4.08, 3.61, 
                      3.83, 3.31, 4.13, 3.26, 3.54)

We want to know whether there is a significant difference in mean birth weight between the two categories.

# Write down the hypotheses
# H0: there is no difference in mean birth weight between groups: d == 0
# H1: there is a difference, d != 0

# choose a significance level
# alpha = 0.05

# Define test statistic: difference of group mean

stat_mu = mean(data_heavysmoking) - mean(data_nonsmoking)
stat_mu
#> [1] -0.5156923

2.2.2 Null distribution approximated by resampling

#' Simple function to generate permutation distribution
get_permutation_null <- function(x1, x2, n_permute=1000) {
  n1 = length(x1)
  n2 = length(x2)
  
  # pool data sets
  x_pool = c(x1, x2)
  
  null_distr = rep(0, n_permute)
  for (i in seq(n_permute)) {
    # split
    idx = sample(n1 + n2, size=n1)
    x1_perm = x_pool[idx]
    x2_perm = x_pool[-idx]
    
    # calculate test statistic
    null_distr[i] = mean(x1_perm) - mean(x2_perm)
  }
  
  return(null_distr)
}

set.seed(1)
perm_null = get_permutation_null(data_heavysmoking, data_nonsmoking)

We can plot the histogram of the null distribution obtained by resampling. We can also add line(s) for the values as extreme as observed statistic mu, where we can consider one side or both side as extreme values.

df_perm = data.frame(perm_null = perm_null)

ggplot(df_perm, aes(x=perm_null)) + 
  geom_histogram(bins=20) +
  geom_vline(xintercept=stat_mu, linetype="dashed", color="tomato") +
  geom_vline(xintercept=-stat_mu, linetype="dashed", color="tomato") +
  xlab('Difference of group mean') +
  ylab('Resampling frequency') +
  ggtitle('Distribution of mu under the null hypothesis')

## Two tailed p value
p_two_tailed = mean(abs(perm_null) >= abs(stat_mu))
p_one_tailed = mean(perm_null < stat_mu)

print(paste("Two tailed p value:", round(p_two_tailed, 5)))
#> [1] "Two tailed p value: 0.003"
print(paste("One (left) tailed p value:", round(p_one_tailed, 5)))
#> [1] "One (left) tailed p value: 0.002"

2.3 t test

2.3.1 Derivation of t distribution

Null distribution approximated by \(t\) distribution

We use the t test to assess whether two samples taken from normal distributions have significantly different means.

The test statistic follows a Student’s t-distribution, provided that the variances of the two groups are equal.

Other variants of the t-test are applicable under different conditions.

The test statistic is \[ t = \frac{\bar{X}_{1} - \bar{X}_{2}}{s_p \cdot \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \]

where \[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

is an estimator of the pooled standard deviation.

Under the null hypothesis of equal means, the statistic follows a Student’s t-distribution with \((n_{1} + n_{2} - 2)\) degrees of freedom.

# Same test statistic: difference of group mean

stat_t = mean(data_heavysmoking) - mean(data_nonsmoking)

stat_t
#> [1] -0.5156923

Calculate parameters for approximate t distribution

n_ns = length(data_nonsmoking)
n_hs = length(data_heavysmoking)

s_ns = sd(data_nonsmoking) # degree of freedom: n-1
s_hs = sd(data_heavysmoking)

# the pooled standard deviation
sp = sqrt(((n_ns - 1)*s_ns**2 + (n_hs - 1)*s_hs**2) / (n_ns + n_hs - 2))
print(paste0("Pooled standard deviation:", sp))
#> [1] "Pooled standard deviation:0.428057812829366"

my_std = sp * sqrt(1/n_ns + 1/n_hs)
print(paste("Estimated standard error of mean difference:", my_std))
#> [1] "Estimated standard error of mean difference: 0.162204962956089"

stat_t_scaled = stat_t / my_std
print(paste("Rescaled t statistic:", stat_t_scaled))
#> [1] "Rescaled t statistic: -3.17926343494134"

print(paste("degree of freedom", n_hs+n_ns-2))
#> [1] "degree of freedom 26"

Here, we focusing the standardized \(t\) distribution, namely the variance=1, so let’s re-scale the test statistic by dividing the standard error my_std.

xx = seq(-4.5, 4.5, 0.05)
xx_pdf = dt(xx, df=n_hs+n_ns-2)

df_t_dist = data.frame(x=xx, pdf=xx_pdf)

ggplot(df_t_dist, aes(x=x)) + 
  geom_line(aes(y=pdf)) +
  geom_vline(xintercept=stat_t_scaled, linetype="dashed", color="tomato") +
  geom_vline(xintercept=-stat_t_scaled, linetype="dashed", color="tomato") +
  xlab('Difference of group mean') + 
  ylab('PDF approximated by t distr.') +
  ggtitle('Distribution of t under the null hypothesis')

# Note, we used multiply 2 just because the t distribution is symmetric,
# otherwise, we need calculate both side and add them.

pval_t_twoside = pt(stat_t_scaled, df=n_hs+n_ns-2) * 2

print(paste('t-test p value (two-tailed):', round(pval_t_twoside, 6)))
#> [1] "t-test p value (two-tailed): 0.003793"

2.3.2 Direct use of `t.test()`

In course and most of your future analyses, you can directly use the built-in t.test() function.

# Note, we assumed the variance in both groups are the same, 
# we so need to set var.equal = TRUE

t.test(data_nonsmoking, data_heavysmoking, var.equal = TRUE)
#> 
#>  Two Sample t-test
#> 
#> data:  data_nonsmoking and data_heavysmoking
#> t = 3.1793, df = 26, p-value = 0.003793
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.1822752 0.8491094
#> sample estimates:
#> mean of x mean of y 
#>  3.667692  3.152000

2.4 regression-based test

We can also perform t-test in a Generalised linear model (GLM) setting to test if a coefficient is zero or not.

Here, we simply use the marketing dataset as an example.

# Install datarium library if you haven't
if (!requireNamespace("datarium", quietly = TRUE)) {
  install.packages("datarium")
}

library(datarium)

# Load data: then we will have a data.frame with name marketing
data(marketing)
head(marketing)
#>   youtube facebook newspaper sales
#> 1  276.12    45.36     83.04 26.52
#> 2   53.40    47.16     54.12 12.48
#> 3   20.64    55.08     83.16 11.16
#> 4  181.80    49.56     70.20 22.20
#> 5  216.96    12.96     70.08 15.48
#> 6   10.44    58.68     90.00  8.64

ggplot(marketing, aes(x=newspaper, y=sales)) + 
  geom_point() + geom_smooth(method=lm)
#> `geom_smooth()` using formula = 'y ~ x'

# Fit linear regression
res.lm <- lm(sales ~ newspaper, data = marketing)

# We can check the test via the summary() function
summary(res.lm)
#> 
#> Call:
#> lm(formula = sales ~ newspaper, data = marketing)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -13.473  -4.065  -1.007   4.207  15.330 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 14.82169    0.74570   19.88  < 2e-16 ***
#> newspaper    0.05469    0.01658    3.30  0.00115 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.111 on 198 degrees of freedom
#> Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
#> F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

glm_t_val = summary(res.lm)$coefficients["newspaper", "t value"]

xx = seq(-5, 5, 0.01)
yy = dt(xx, 198)
df_ttest <- data.frame(x=xx, PDF=yy)

ggplot(df_ttest, aes(x=x, y=PDF)) + 
  geom_line() + 
  geom_vline(xintercept = glm_t_val, linetype="dashed", color="tomato") +
  geom_vline(xintercept = -glm_t_val, linetype="dashed", color='tomato')

2.5 Multiple testing

Hypothetical null distribution. Feel feel to try any null distribution, examples below

## Example null distributions
# t, normal or anything. we use chi-squared distribution as an example

x_random = rchisq(n=1000, df=3)
any_null_dist = dchisq(x_random, df=3)

pvals_null = 1 - pchisq(x_random, df=3)

2.5.1 Null distribution (of test statistic)

# Null distribution of test statistic
hist(x_random)

2.5.2 Null distribution of p value

# Null distribution of test statistic
hist(pvals_null)

2.5.3 Minimal p values in 10 tests

# We use matrix to group 100 trials into a column
# We then use apply() to calculate min value for each column

pval_null_mtx = matrix(pvals_null, nrow=10)

p_min_in10 = apply(pval_null_mtx, MARGIN=2, FUN=min)

hist(p_min_in10)

print(paste('Proportion of tests with min(p) < 0.05:', mean(p_min_in10 < 0.05)))
#> [1] "Proportion of tests with min(p) < 0.05: 0.43"

2.6 Explore power and sample size (optional) { power }

Make a simulation of score: group A and B
B follows normal(mean=0, std=1); A follows normal(mean=0.1, std=1)
Generate 100 samples for each group, and do a t test, is difference significant? Please use set.seed(0) beforehand.
Try 3) again but general 3,00 samples this time, later 1,000 samples. What do you find? Think the relation between power and sample size.

set.seed(0)

n_sample = 100 # change this value to 1000 and 10000
xB = rnorm(n_sample)
xA = rnorm(n_sample, mean=0.1)
t.test(xA, xB, var.equal = TRUE)
#> 
#>  Two Sample t-test
#> 
#> data:  xA and xB
#> t = 0.24294, df = 198, p-value = 0.8083
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.2261882  0.2897482
#> sample estimates:
#>  mean of x  mean of y 
#> 0.05444844 0.02266845