This week we will continue working with our field data from Siberia and learn how examine and quantify relationships between variables.
To begin, let’s create a new script, fill in the appropriate header information, and set our working directory.
#####################
# correlation script
# GEOG250 F19
# MML 11/05/19
#####################
# set working directory to my folder on the server
setwd("Z:/Geog250_F19/loranty/")
# remember that you should be writing good descriptive comments here - these are your notes!
Last week we learned to calculate the covariance between two variables in order to qunatify the linear association between them. More specifically, we use the Pearson’s correlation coefficient (r) as a means of normalizing the covariance so that it scales from -1 to 1, and can be interpreted as the strength of the relatiohship between the two variables.
This week we will extend this process by generating an equation that describes the relationship between the variables of interest. It is important to note that regression analysis assumes a linear relationship between two variables: the dependent or reposnse variable (y), and the independent or explanatory variable (x). Stated plainly, we can use regression analysis to quantify how the magnitude of change in our dependent variable (y), given a change our independent variable (x). For example, how much will leaf mass change in response to an increase in shrub basal area. We should note that we will consider only simple linear regression, which considers two variables, as opposed to a multiple regression that may include more than one independent or explanatory variable.
For a linear relationship between two variables, we can use the following equation to represent the best fit line between the data points. You can think of this line as a model capable of predicting values of the dependent variable (\(\hat{y}\)), for a given observation of the independent variable (x), the intercept of the line (a), and the slope of the line (b).
\(\hat{y}\)=\(a + bx\)
We will not delve into the mathematics of regression analysis here, but there are several important points to consider related to regression analysis. To help illustrate this points we can look again at some of our own data. As we did last week, let’s read in our shrub allometry data set and create an object names shr
.
Next, subset the data to include only Alder shrubs, and create the following plot.
The first question we ought to consider is whether or not there appears to be a linear relationship between our variables? What do you think? If the answer is yes, then we may proceed, otherwise we could either transform our data to make it linear, or use a non-linear regression. We will not cover non-linear regression this semester, but it is good to know that this is an option that exists.
If our relationship appears linear, we can go ahead and perform a regression analysis in R using the lm
function.
aln.reg <-lm(aln$mass~aln$diameter)
Now we can use the summary
function to learn a bit more about our regression analysis, and the abline
function to add the regression line to our plot. Let’s step through the summary of our regression to understand what is happening.
In particular it is important to understand what the concept of minimizing the residuals means.
summary(aln.reg)
##
## Call:
## lm(formula = aln$mass ~ aln$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.921 -4.196 -1.060 3.753 14.126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.623 3.797 -3.851 0.00128 **
## aln$diameter 22.795 2.082 10.948 4.04e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.021 on 17 degrees of freedom
## Multiple R-squared: 0.8758, Adjusted R-squared: 0.8685
## F-statistic: 119.9 on 1 and 17 DF, p-value: 4.042e-09
plot(aln$diameter,aln$mass,
pch = 16,
xlab = "Basal Diameter (cm)",
ylab = "Leaf Mass (g)")
abline(aln.reg,lty="dashed")
Let’s look at how our predictions compare to the observed values.
summary(aln.reg)
##
## Call:
## lm(formula = aln$mass ~ aln$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.921 -4.196 -1.060 3.753 14.126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.623 3.797 -3.851 0.00128 **
## aln$diameter 22.795 2.082 10.948 4.04e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.021 on 17 degrees of freedom
## Multiple R-squared: 0.8758, Adjusted R-squared: 0.8685
## F-statistic: 119.9 on 1 and 17 DF, p-value: 4.042e-09
plot(aln$mass,predict(aln.reg),
pch = 16,
xlab = "Observed Leaf Mass (g)",
ylab = "Predicted Leaf Mass (g)")
abline(a = 0, b = 1)
We can do the same thing for our Betula data. In thise case the regression look quite good, however the data are slightly non-linear at low values.
##
## Call:
## lm(formula = bet$mass ~ bet$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4075 -0.6924 -0.0194 0.4416 5.5294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5645 0.7492 -3.423 0.00244 **
## bet$diameter 7.6208 0.8516 8.949 8.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.583 on 22 degrees of freedom
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7747
## F-statistic: 80.08 on 1 and 22 DF, p-value: 8.746e-09
So, here we might try to log-transform our data. What difference does this make?
bet <- shr[shr$species=="bet",]
plot(bet$diameter,log(bet$mass),
pch = 16,
xlab = "Basal Diameter (cm)",
ylab = "log Leaf Mass (g)")
abline(lm(log(bet$mass)~bet$diameter))
summary(lm(log(bet$mass)~bet$diameter))
##
## Call:
## lm(formula = log(bet$mass) ~ bet$diameter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7400 -0.3205 0.0353 0.3392 0.6865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7861 0.1874 -4.195 0.000375 ***
## bet$diameter 2.0799 0.2130 9.764 1.86e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.396 on 22 degrees of freedom
## Multiple R-squared: 0.8125, Adjusted R-squared: 0.804
## F-statistic: 95.33 on 1 and 22 DF, p-value: 1.861e-09
This week we are going to branch out a bit and use a publicly available data set to do the following:
Download this dataset describing soil properties and permafrost thaw depths from the Arctic Data Center website (hint - you can use the read.table
function and the data file URL to read the data directly into R)
Aggregate the data by site.
Calculate a linear regression to describe the relationship between soil organic layer depth and permafrost thaw depth.
Create a scatter plot of your data that includes the best fit regression line.
Write a brief paragraph the describes the results of your analysis. This should include key regression parameters and statistics, a copy of your plot, and a brief discussion of the relationship between your data. For the discussion, you should consider why these variables may or may not be related, and what other important factors could be missing. (See this paper for help)
In addition see the following document outlining your class research project. Before class next week you should articulate a draft of your research question.
This assignment is due by noon on Tuesday 19 Nov.