Activity 2: Intro to Probability

GEOG 250: Research Methods in Geography, Colgate University

Learning objectives

1. Handling data in R

2. Intro to probability and summary statistics

3. Characterize weather events

Continued work with climate data in R

Let’s begin by reloading our climate data from HW2, and converting the data to an easy to use format.

#read our file as a dataframe
datW <- read.csv("data/2011124.csv", stringsAsFactors = T)

#specify a column with a proper date format
#note the format here dataframe$column
datW$dateF <- as.Date(datW$DATE, "%Y-%m-%d")

#create a date column by reformatting the date to only include years
#and indicating that it should be treated as numeric data
datW$year <- as.numeric(format(datW$dateF,"%Y"))

Descriptive Statistics and Histograms

Let’s run some descriptive statistics on the weather data. If you look at the snapshot of the metadata below, you will find information about the observations in each column. Note that I downloaded with the metric option.

Let’s start by looking at the data using some basic summary statistics. Recall that the mean is a measure of central tendency of our data. You have probably calculated it at some point by adding all of you observations and dividing by the total number of observations. In some fields, this may be referred to as an average for samples, and mean may be used more specifically for a probability distribution (more on that later!). There is a built in function in R to calculate a mean. We will also want to calculate the standard deviation. recall that this measures the spread of the observations around the mean, and keeps the units the same as the mean.

We could start by looking at the mean of a single site using the mean function. We can check all site names using the levels function. You can subset to a single site by using square brackets and indicating that we want a vector for all TMAX values where the NAME is equal to ABERDEEN, WA US.

#find out all unique site names
levels(datW$NAME)

## [1] "ABERDEEN, WA US"                  "LIVERMORE, CA US"                
## [3] "MANDAN EXPERIMENT STATION, ND US" "MORMON FLAT, AZ US"              
## [5] "MORRISVILLE 6 SW, NY US"

#look at the mean maximum temperature for Aberdeen
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"])

## [1] NA

You get a NA value here. That’s because there is missing data in this data set. NA is a specification that allows you to know that the data is missing and we should not expect a value. NA is handled differently in R and is neither a number or character. Luckily there is an argument in mean that allows us to ignore NAs in the calculation.

#look at the mean maximum temperature for Aberdeen
#with na.rm argument set to true to ingnore NA
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"], na.rm=TRUE)

## [1] 14.68515

Now you will see the average daily maximum temperature in Aberdeen is 14.68 °C. Since this is the average across many decades of observations, this value can help tell us about the climate of the site. Before we move on, average daily temperature is often more helpful to evaluate temperature. The average is always halfway between the minimum and maximum. We can calculate it as follows:

#calculate the average daily temperature
#This temperature will be halfway between the minimum and maximum temperature
datW$TAVE <- datW$TMIN + ((datW$TMAX-datW$TMIN)/2)

Let’s take a look at how often different temperature values are observed at each site. You’ll use a graphical tool called a histogram. A histogram shows the frequency of temperature observations in different bins.

#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful. 
hist(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],
        freq=FALSE, 
        main = paste(levels(datW$NAME)[1]),
        xlab = "Average daily temperature (degrees C)", 
        ylab="Relative frequency",
        col="grey50",
        border="white")

To get a better idea of how the summary statistics describe the data, let’s add lines to the plot to designate where the mean and standard deviation are located. Note we’ll use the standard deviation function. R abbreviates this function name as sd. The abline function allows us to add lines to a plot. The v argument in this function means add a vertical line. I’ll add a red solid line for the mean and red dashed lines for the standard deviation from the mean.

#make a histogram for the first site in our levels, Aberdeen
#main= is the title name argument.
hist(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],
        freq=FALSE, 
        main = paste(levels(datW$NAME)[1]),
        xlab = "Average daily temperature (degrees C)", 
        ylab="Relative frequency",
        col="grey50",
        border="white")
#add mean line with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE), 
                col = "tomato3",
                lwd = 3)
#add standard deviation line below the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE) - sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE), 
              col = "tomato3", 
              lty = 3,
              lwd = 3)
#add standard deviation line above the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE) + sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE), 
              col = "tomato3", 
              lty = 3,
              lwd = 3)

You’ll notice that one standard deviation encompasses much of the data. If we went 2 standard deviations out, almost all of the data would be included.

Probability Distributions

Discrete probability distributions

Let’s return to our thaw depth data set for a moment, it happens to include a discrete variable. Let’s read it in and have a look. Which variable in this data set is discrete?

thaw <- read.csv("data/thaw_depth.csv")

You may realize that our discrete variable only occurs at one field site represented in this data set (wws), so before we can go further we need to create a subset. Let’s do that now using matrix notation and the site variable in a logical statement. Define your data subset as an object called w.thaw

For the sake of simplicity let’s assume that our variable, tussock, which indicates the presence or absence of a tussock at our sampling location, has a probability of 0.5 for either outcome. In this case we are interested in knowing where there are tussocks co-located with our thaw depth measurements along the sampling transects. You can see in the photo below that tussocks create an accumulation of soil that sits above the surrounding vegetation, and might affect our measurements.

$Cross section of a beautiful Siberian tussock in the wild! (photo M. Loranty)$

If we assume that the probability (p) of tussock occurrence is constant, and that tussocks occur independently of each other then tussock is a binomial variable that describes a binomial process. Subsequently our data can be described using a binomial distribution, which we can use to find the probability that particular outcomes will occur. We’ll not go through the mathematics of these calculations here, but you should know that in order to determine the probability of a random variable or particular outcome (e.g. have 25 tussocks) we need to know only the total number of independent trials (n) and the equal probability of success (p) for each outcome. So in our case we want to know the likelihood of 25 tussocks given p = 0.5 and n = 101. Here we can use the dbinom function in R to compute this probability as follows.

sum(dbinom(0:25,101,0.5))

## [1] 1.861571e-07

In this case the probability that we will have 25 out 101 measurements with tussocks is quite low, which makes sense. Given the probability of 0.5, so our expected value or theoretical mean would be 101 x 0.5, or 50.5.

Calculate the probability of having 50 tussocks out of 101 measurements. What is the answer?

We can also use the the rbinom to create a random data set with our distribution, and plot this as a histogram.

#create a histogram for a randomly generated binomial data set
hist(rbinom(101,101,0.5),breaks = 1:100,    
     xlab = "Tussock Presence",main="") 

# add lines to describe the probability density 
lines(1:101,dbinom(1:101,101,0.5)*101,
      lwd = 2, lty = "dashed")

Continuous probability distributions

The data distribution that we just viewed has a very particular shape. Temperature observations are most frequent around the mean and we rarely observe data 2 standard deviations from the mean. The distribution is also symmetrical. We can describe this formally with a probability distribution. Probability distributions have a lot of mathematical properties that are useful. We use parameters to help describe the shape of how data is distributed. This temperature data follows a normal distribution. This type of distribution is very common and relies on two parameters: the mean and standard deviation to describe the data distribution. Below is an image of the normal distribution where zero represents the value at the mean of the data and tick marks are designated with standard deviations.

Probability distributions all have functions in R. Below, I show code for using the dnorm function to calculate the probability density for a range of temperature values to add to the plot .

#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful. 
#note I've named the histogram so I can reference it later
h1 <- hist(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],
        freq=FALSE, 
        main = paste(levels(datW$NAME)[1]),
        xlab = "Average daily temperature (degrees C)", 
        ylab="Relative frequency",
        col="grey50",
        border="white")
#the seq function generates a sequence of numbers that we can use to plot the normal across the range of temperature values
x.plot <- seq(-10,30, length.out = 100)

#the dnorm function will produce the probability density based on a mean and standard deviation.
 y.plot <-  dnorm(seq(-10,30, length.out = 100),
             mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
             sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))
 
#create a density that is scaled to fit in the plot  since the density has a different range from the data density.
#!!! this is helpful for putting multiple things on the same plot
#!!! It might seem confusing at first. It means the maximum value of the plot is always the same between the two datasets on the plot. Here both plots share zero as a minimum.
 y.scaled <- (max(h1$density)/max(y.plot)) * y.plot
   
#points function adds points or lines to a graph  
#the first two arguments are the x coordinates and the y coordinates.

points(x.plot,
       y.scaled, 
       type = "l", 
       col = "royalblue3",
       lwd = 4, 
       lty = 2)

You can now see the blue dashed line overlaid on the histogram. This is the normal distribution using the mean and standard deviation calculated from the data. You’ll notice the normal distribution does a good job of modeling our data. Sometimes it underestimates a data range and at other points it overestimates it, but overall it mirrors the distribution of our data. This means we can rely on properties of the normal to help describe our data statistically!

Now let’s examine how we can use probability think about the occurrence of different air temperature ranges. For a given value of the data, the Normal distribution has a probability density associated with observing the value. The probability density doesn’t mean anything at a given value of the data. We can’t really do anything with that particular number. However, when the normal distribution is integrated across a range of values, it yields a probability for the occurrence of the range of values. For those of you that haven’t had calculus, integrating is essentially taking the area under the curve between a range of numbers. We have to keep in mind that the range of the normal distribution extends from -$\infty$ to $\infty$. Let’s start by taking a look at all values below freezing in the normal distribution for our Aberdeen weather data. Technically this is the probability of all temperatures below freezing from zero to -$\infty$. Functionally we know some low temperatures would be impossible to observe on earth and the probability of observing values closer to -$\infty$ will be minuscule. Below is a graphical representation of the area of the curve that describes the probability of observing a daily temperature below zero.

Luckily we don’t have to do any of the work calculating the probability. R has a built in suite of functions for working with probability distributions. Below is an image of the documentation for all functions related to the normal distribution. Run the documentation on dnorm to see them all:

help(dnorm)

## starting httpd help server ... done

R uses p to designate probability. Let’s calculate the probability of below freezing temperatures.Don’t forget that probabilities always range from 0 to 1. We would have to integrate across -$\infty$ to $\infty$ to get a probability of 1 in the normal.

#pnorm(value to evaluate at (note this will evaluate for all values and below),mean, standard deviation)
pnorm(0,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))

## [1] 0.01682526

You can see below freezing temperatures are rare at this site and we only expect them to occur about 1% of the time. I can take advantage of the properties of the distribution and add and subtract areas under the curve to better tailor my ranges of numbers. For example, I might be interested in identifying how often a temperatures between 0-5 degrees occur.

#pnorm with 5 gives me all probability (area of the curve) below 5 
pnorm(5,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))

## [1] 0.1343358

However, if I subtract the area below 0 from that number, I will get the probability of temperatures in the range of 0-5.

#pnorm with 5 gives me all probability (area of the curve) below 5 
pnorm(5,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))- pnorm(0,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))

## [1] 0.1175105

Now let’s evaluate the probability of high temperatures. Knowing that the entire distribution adds up to 1, we can also find the area above a value. For example, let’s look at the probability of temperatures above 20 degrees C.

#pnorm of 20 gives me all probability (area under the curve) below 20 
#subtracting from one leaves me with the area above 20
1 - pnorm(20,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))

## [1] 0.02569572

The qnorm function will return the value associated with a probability. This is the value in which all values at or below the value equal that probability. Let’s use this to evaluate extreme weather events. Let’s assume everything that occurs with a probability of less than 10% of the time (either hot or cold so anything above 95% or anything below 5%) is unusual. Let’s examine what unusually high temperatures in Aberdeen start at:

#qnorm tells us what high temperature occurs at 95% 
#or less than 5% of the time
qnorm(0.95,
      mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE),
      sd(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE))

## [1] 18.51026

Homework 3

Note: Throughout all of my code examples, you’ll notice that I continued to copy and paste the same code for calculating the mean for Aberdeen: mean(datW$TAVE[datW$NAME == "ABERDEEN, WA US"],na.rm=TRUE). While I did this to help you remember what was going into the function, it gets confusing and messy in long functions. This is a perfect example of why we name variables (with short clear names!) to refer to later on. As this course progresses, we’ll continue to work on creating clean code once you get more comfortable with R.

Submit a Word/PDF document with your written answers, and your R script to Moodle Before class on Tuesday.