Activity 3: Introduction to Inference and Hypothesis Testing

GEOG 250: Research Methods in Geography, Colgate University

Learning objectives

1. Understanding z-scores and sample probabitlities

2. Testing hypotheses with t-test and ANOVA

This week we will continue working with our field data from Siberia and learn how to use basic statistics to test hypotheses.

To begin create a new script - this will for the basis of your participation grade for today’s class.

1. Samples and Confidence Intervals

Since we first began using R we have primarily been focused on using descriptive statistics to characterize our sample data sets. Now that we have a better understanding of our data, as well as some key concepts related to probability and distributions, we can begin to think about using inferential statistics to make inferences about populations using our sample data sets.

Characterizing the confidence interval for a mean is a useful place to start. Here,will rely on the central limit theorem, which states that the mean value of a sample that is comprised of a series of independent observations of a random variable tends toward a normal distribution. In other words, for example, if we were to collect many samples of permafrost thaw depth for a particular research site, comprised of 30 observations, the mean thaw depth values of all of the samples would have a normal distribution centered around the true mean for each site. Because of this, we can use our knowledge of the standard normal distribution to quantify how confident we are that the true mean for a site is within a given interval of the sample mean.

We can explore these ideas more using our thaw depth data set.

## [1] 46.37624

## [1] 11.09581

## [1] 101

Before we continue further we need to go over one more descriptive statistic that we haven’t discussed in detail before today, the z-score. This is a way to characterize an observation from a sample data set in terms of how many standard deviations away from the mean it is. And is calculated as the difference between the observation value and the sample mean, divided by the standard deviation. For example we could do this for the maximum thaw depth value from our WWS site.

# determine the maximum value from the wws site
m <- max(thaw$td[thaw$site=="wws"])

# now calculate the z-score
 z <- (m-x)/s
 
 # use the round function to round the z-score to the nearest hundredth for use in Table A2
 z <- round(z,digits=2)

What does this value mean? Use this value to look up the associated probability in Table A2 of the text book, or using this online z-table.

What is the probability from the z-score Table, and what does this mean?

It turns out that last week that we were actually using the z-score to determine the probability associated with a particular value, given the standard normal distribution. For example, we can think of the likelihood of getting a value greater than ours as indicated below.

# plot the standard normal distribution
plot(seq(-3,3,0.1),dnorm(seq(-3,3,0.1),0,1),
     xlab = "",ylab = "Probability", type = "l", lwd = 2)


polygon(c(seq(z,3,0.01),seq(3,z,-0.01)),
        c(dnorm(seq(z,3,0.01),0,1),rep(0,length(seq(3,z,-0.01)))),
        col = "blue")

Lastly, remember that we can also use the pnorm function, that describes random normal distributions to calculate these things as well.

pnorm(m,x,s)

## [1] 0.9792723

# or 

1-pnorm(m,x,s)

## [1] 0.0207277

So, basically, we’re using our z-scores to determine the probability of exceeding a certain value, or not, by characterizing that value in terms of the mean and standard deviation, and then assigning a probability related to the assumption that the data are normally distributed.

So, if we bring this back to the context of the sample mean versus the population mean, there are just a few more key pieces of info that we need. First, it turns out that, for a standard normal distribution 95% of the data fall within 1.96 standard deviations of the mean, which looks like this graphically.

Remember that this accounts for 95% of the area under the curve. And remember also that the variability, or standard error, of sample means is equal to the variance divided by the square root of the sample size. (Remember we calculated this as the Standard Error of the Mean)

So, we can calculate the 95% confidence interval as 1.96 times the standard deviation divided by the square root of the sample size.

(1.96*s)/sqrt(n)

## [1] 2.163986

What does this number mean to us?

Note that the z-value applies only when the sample size (n) is sufficiently large, and in the case of small samples the t-distribution is used. A sample size of 30 is typically sufficient.

For example, if we define the mean, standard deviation, and sample size of thaw depth data from the BFG site as x2, s2, and n2

x-x2

## [1] 2.75719

1.96*sqrt(s^2/n+s2^2/n2)

## [1] 3.724203

How do we interpret these values?

Note that we can also calculate confidence intervals for proportional data using a similar approach. But in the interest of time we will not step through that here.

2. Hypothesis Testing

Now, using these assumptions related to the normal distribution of sample data, we can test hypotheses. We’ll continue on using our thaw depth data as an example and begin by performing a one-sample z-test of the mean. The first step is to develop a null hypothesis.

For example, the null case might be that the true mean thaw depth at WWS is 45, which is long term mean for similar ecosystems in the region.

H₀:\(\mu\)=45

Null hypotheses can be viewed as a default, in that failing to reject it indicates that our sample is representative of the regional average.

Next, we establish an alternative hypothesis. Alternative hypotheses can be one-sided (e.g. the true mean thaw depth is greater than 45cm),

H_A: \(\mu\)>45

or two-sided (e.g. the true mean thaw depth is not equal to 45cm).

H_A: \(\mu\neq\) 45

In the latter case we are saying that the true mean thaw depth can be either less than or greater than the sample mean.

Here we also need to choose a significance level or \(\alpha\), which corresponds to the likelihood of making a Type I error. This is where the true null hypothesis is rejected (e.g. the commonly used \(\alpha\) = 0.05 means that the true null is rejected 95% of the time).The alternative is a Type II error, which is when we fail to reject a false null hypothesis, and we cannot control this (though it is inversely proportional to \(\alpha\) )

Now we can calculate our z-score as follows:

z = \(\frac{x-\mu}{s/\sqrt{n}}\)

((x-45)/(s/sqrt(n)))

## [1] 1.246508

Using the z-score Table, what is our p-value?

What does this mean, and how do we interpret this result?

This type of testing can be useful if you are interested in knowing whether your sample is representative of, or deviates from, a larger population. Note too that as with our previous examples, the t-score can be used in cases with a small sample size, and there are also variants of this analysis that can be used with proportional data.

Before we move on, it is important to take a moment to consider p-values. Traditionally, p-values have been used to determine whether or not a result is statistically significant. This has come to be seen as problematic for several reasons, including the arbitrary selection of \(\alpha\) values, and the potential for misinterpretation of results. As an alternative to this, some scientists now simply report p-values, without definitively rejecting or failing to reject hypotheses, and let the scientific community interpret the results and determine their importance.

Now that we’ve considered that, let’s look at one final example; testing for differences in means. Here we will use a t-test, which you may be familiar with and is often referred to as the Student’s t-test. As you’ve likely noted from the many hours spent pouring through the text book,our math is becoming slightly more complex. Luckily for us, there happens to be the t.test function for us to use in R.

Let’s give it a whirl.

t.test(thaw$td[thaw$site=="wws"],
       thaw$td[thaw$site=="bfg"],
       var.equal = T,
       conf.level = 0.01)

## 
##  Two Sample t-test
## 
## data:  thaw$td[thaw$site == "wws"] and thaw$td[thaw$site == "bfg"]
## t = 1.4855, df = 162, p-value = 0.1393
## alternative hypothesis: true difference in means is not equal to 0
## 1 percent confidence interval:
##  2.733891 2.780489
## sample estimates:
## mean of x mean of y 
##  46.37624  43.61905

# or we can specify our variables a little differently,
# and change our confidence interval
# and our assumption about the variance
t.test(td~site,
       data = thaw,
       var.equal = F,
       conf.level = 0.05)

## 
##  Welch Two Sample t-test
## 
## data:  td by site
## t = -1.4511, df = 121.71, p-value = 0.1493
## alternative hypothesis: true difference in means is not equal to 0
## 5 percent confidence interval:
##  -2.876585 -2.637795
## sample estimates:
## mean in group bfg mean in group wws 
##          43.61905          46.37624

# we can even use this for one sample testing
t.test(thaw$td[thaw$site=="wws"],
       mu=45,
       conf.level = 0.01)

## 
##  One Sample t-test
## 
## data:  thaw$td[thaw$site == "wws"]
## t = 1.2465, df = 100, p-value = 0.2155
## alternative hypothesis: true mean is not equal to 45
## 1 percent confidence interval:
##  46.36237 46.39011
## sample estimates:
## mean of x 
##  46.37624

As the text book described, we can use an Analysis of Variance (commonly referred to as ANOVA) to test for differences between three or more samples. Stated simply, an ANOVA tests whether 3 or more samples are drawn from different populations by quantifying whether variance between sample groups is greater than the variance within sample groups. Functionally it is very similar to the t-test. To illustrate this, let’s perform a t-test and an ANOVA with our thaw depth data. Note the the function for ANOVA is aov, and that I am creating objects from each output. It is important to note that your groups variable (site in the )

# perform a t-test assuming equal variance
t1 <- t.test(td~site,
          data = thaw,
          var.equal = T,
          conf.level = 0.05)

# perform an Analysis of Variace with the same data
t2 <- aov(td~site,
          data = thaw)

Examine the t-test results.

#examine the results
t1

## 
##  Two Sample t-test
## 
## data:  td by site
## t = -1.4855, df = 162, p-value = 0.1393
## alternative hypothesis: true difference in means is not equal to 0
## 5 percent confidence interval:
##  -2.873757 -2.640623
## sample estimates:
## mean in group bfg mean in group wws 
##          43.61905          46.37624

Examine the ANOVA results. Note we have to use the summary function to see the results, and the TukeyHSD function to look at differences between means.

# we have to use the summary function for the ANOVA results
summary(t2)

##              Df Sum Sq Mean Sq F value Pr(>F)
## site          1    295   294.9   2.207  0.139
## Residuals   162  21653   133.7

# to look a differences between means
# and related confidence intervals
# use the TukeyHSD function
TukeyHSD(t2,
         conf.level = 0.05)

##   Tukey multiple comparisons of means
##     5% family-wise confidence level
## 
## Fit: aov(formula = td ~ site, data = thaw)
## 
## $site
##            diff      lwr      upr     p adj
## wws-bfg 2.75719 2.640623 2.873757 0.1393494

How do these results compare to one another? Do both tests reach the same conclusion? What are the noticeable differences between the outputs?

Homework 4

The following questions will build on some of your code from Homework 2. You should create a new script for this assignment names hw4_script.R and save this in your GEOG250 project. You should copy relevant code from HW2 to get started, and be sure to correct any mistakes you made. Submit a Word/PDF document with your written answers, and your R script to Moodle Before class on Tuesday Oct 27th.

In the text book, the author discusses an example ANOVA that tests whether there are significant differences in precipitation from day to day in New York City. The underlying hypothesis is that air pollution associated with industrial emissions could generate more cloud condensation nuclei, which in turn could lead to more precipitation. The results indicate greater precipitation in Fridays on Saturday that could possible be explained by a time lag in precipitation.

Most of the climate data we have been working with come from rural locations that do not likely experience diurnal variation in air pollutants throughout the week. To understand the problem better we will analyze day to day variation in precipitation for our data set to see if there are any patterns.