GEOG 250: Research Methods in Geography, Colgate University

Learning objectives

1. Learn to use RStudio Cloud

2. Introduction to R

3. Handling data in R

Introduction to R & RStudio Cloud

R is a powerful set of tools for manipulating, analyzing, and visualizing data. These tools are open source, which means that they are and always will be free to use, and they are developed by a worldwide community of users. The R programming language is at the core of these tools. Using the R language you can write commands that will tell your computer what you want it to do with your data. While this may seem a bit daunting there are a number of advantages associated with using tools like R to work with data, instead of using spreadsheet programs where you point and click, for example. In particular you can perform complex calculations with large amounts of data very quickly, and more importantly you can save a record of your commands to make everything you do transparent and reproducible. This is very important in science!

For our class we are going to use RStudio Cloud to access R. RStudio is something we call an Integrated Development Environment, or IDE for short. It is important to remember that R is the programming language and computer platform, you can use R by itself, but you cannot use RStudio without R. RStudio is simply a nicer interface for interacting with R that includes handy features for keeping your work space organized. As you might have guessed, there is a desktop version of RStudio you can install on your computer, or a cloud-based version that we’ll be using. Let’s get started.

This semester we will use a shared work space that I’ve created for our class. Click the link I’ve shared with you to join our class work space. You should see something like this:



Click the START button to create a copy of the project for yourself. It might take a moment to load the project, and then you will see the RStudio IDE.



There are a few important things to point out here. On the left side of the window you will see the Console. This is where we type commands that tell R what to do. In the upper right is the Environment. Once we start working with different data it will appear here. In the bottom right is the Files window. It only has some basic files associated with our project, but as we add data, scripts, and create graphs they will appear here. Notice in all of these panes there are tabs at the top that allow to see other important things as well. We’ll learn more about these a little later.

Using R to create and manipulate data

Let’s get a feel for how R works more. Since R is just like a sophisticated calculator, you can type in numerical operations and will get the calculations as output. Type a few different operations like the one below. Note I’ve included both the code and outputs here for an example.

# remember that this is a comment that R will ignore
# it's extremely important to use comments for documenting your code and writing notes
# let's calculate 6 raised to the 6 power
6^6
## [1] 46656
#234 + 8
234+8
## [1] 242
#15-23
15-23
## [1] -8

Because R is a programming language, we need to learn its syntax in order for the computer to understand us. For example try entering the following two pieces of code.

# if you enter an incomplete command R will show '+' until you finish it
5-
  
# R will give you an error message if it doesn't understand the command you've entered
3 % 5 
## Error: <text>:5:3: unexpected input
## 4: # R will give you an error message if it doesn't understand the command you've entered
## 5: 3 % 5 
##      ^

There’s a few things to pay attention to in the console. The > symbol always indicates a new line of code. The + symbol means your line is continuing. Your results will have numbers in brackets to describe the output. [1] Means I have a vector and my output starts on the first element of my vector (in this case its a vector of one).

You can also create objects with assigned names. Below is an example where I know I will want to use the number 2446 many times so I will give it a shorter name. R conducts all operations on that number. The <- symbol means you are assigning an object a variable name.

# name my number
a <- 6234
# multiply my number by 5
a*5
## [1] 31170
# divide my number by 3
a/3
## [1] 2078

In data can be stored in a vector, which is a sequence of data elements with the same basic type. We will talk about data types more letter, so for now let’s just think of data as numbers. R automatically does vector operations, meaning that you can assign two numbers to a variable name and do math operations on them. First you have to let R know you are making a vector of multiple numbers using c(,…,)

# make a vector of numbers
b <- c(2395,82,2947)
#divide all numbers by 2
b/2
## [1] 1197.5   41.0 1473.5

Scripts

We can use an R script to keep a record of our calculations and any other things we do with our data in R. A script is essentially a text file with a record all of our commands in it. In an R script a hashtag # tells R to ignore that particular line. This is called commenting, and it is useful for leaving notes for yourself and others in your script. In fact, commenting is an important aspect of reproducibility, and you will be graded on how well your code is commented.

To make a script in RStudio click on File > New File > R Script. By default this will open in the upper left corner of RStudio, above the console. Before we do anything, save your script you your project directory with the file name 01_intro. Why do we choose this filename?



Now you can type commands in your script and have a record of what we are doing. Try typing a command - what happens? Note that commands will not execute after you hit enter, instead your cursor will just go to the next line. There are a few ways to execute commands in a script. You can press the Run button in the upper right of your script window, press CTRL ENTER - either of these options will execute only the line that your cursor is on. You can highlight multiple lines, and use either of these methods to execute them. Pressing the Source button will run every command in your script.


More on working with data in R

Vectors

Common types of objects in R include vectors, matrices, and dataframes. Last class, we briefly learned about creating vectors in R. You assign a name to an object using the arrow: <- and can make a vector using c(). If you remember, R is vectorized so we can automatically apply vector or matrix math to an object. Running the name of an object will print out the object.

#make a vector of tree heights in meters
heights <- c(30,41,20,22)
#convert to cm
heights_cm <- heights*100
heights_cm
## [1] 3000 4100 2000 2200

R uses a very specific notation around vectors. It is helpful for subsetting. Anytime you see square brackets [ ] recognize that this is subsetting an object in R. Also note that R always counts starting at one (no zeros like other programming languages!). Whenever you see a colon :, know that you are indexing through multiple values.

#look at the first tree height
heights[1]
## [1] 30
#look at the 2nd and 3rd tree heights
heights[2:3]
## [1] 41 20

Matrices

I like to think of matrices as essentially multiple columns of vectors in R. You can set up a matrix using the matrix() function. Functions are built in to R and will always have parenthesis () following the function name. Anything inside of the parentheses is called an argument. Here to build a matrix, you need a vector of numbers that fills in the matrix and specify the number of columns or rows. You can also specify how to fill in the matrix (do you go through all columns in each row?). If you ever aren’t sure of arguments for a function, you can use help or ?.

#get more info on the matrix function
help(matrix)
## starting httpd help server ... done

Now, let’s make some example matrices. Note the differences in filling in by column vs row.

#set up a matrix with 2 columns and fill in by rows
#first argument is the vector of numbers to fill in the matrix
Mat<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=TRUE)
Mat
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
#set up a matrix that fills in by columns
#first argument is the vector of numbers to fill in the matrix
Mat.bycol<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=FALSE)
Mat.bycol
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Notice when the both the matrices are run, the output now has square bracket labels with two spots. This notation will always be formatted [row,column]. You can use this notation to refer to specific spots in a matrix using the name of the matrix.

#subset the matrix to look at row 1, column2
Mat.bycol[1,2]
## [1] 4

An opening in a square bracket after or before a comma means refer to all items in that slot. You’ll notice the output is a vector when subsetting.

#look at all values in row 1
Mat.bycol[1,]
## [1] 1 4
#look at all values in column 2
Mat.bycol[,2]
## [1] 4 5 6

Functions in R

There are a wide variety of functions available to use in R. Functions generally perform some sort of mathematical operation, and can be fairly sophisticated. Functions are followed by parentheses, and you typically write and argument inside the parentheses. For example, imagine that we have a vector of numbers from 1 to 6, and we want to find the mean of these numbers. It turns out there is a mean function, and if we supply our vector as the argument R will calculate the mean of all numbers in the vector.

mean(1:6)
## [1] 3.5

This is convenient, but you may be asking how do I know what functions exist and what the arguments are? For that we can use the help function in R. If you know which function you want to use you can see the documentation in the help window by typing a question mark followed by the function name in the console.

?round

Reading data from a file

Most data that you will work with is too large to be typed into R scripts. A lot of the data will be available in spreadsheet or text file format. You will want to work with this type of data as a dataframe. Dataframes are are matrices where columns have names and rows contain observations for the same entity. This works a lot like a spreadsheet, but we can apply our vector and matrix utilities in R. For example, we will read in a data file for weather data where columns contain data for temperature and precipitation, and each row has an observation for the same day and weather station. You may be used to working with data in excel spreadsheets with a .xlsx extension that is proprietary to Microsoft. This is not ideal to work with, and we will work with .csv (comma separated values file). This file contains the text version of an excel spreadsheet where the values for each cell are included in the file and the designation of a new cell starts with a comma. You can always resave an .xlsx file as a .csv. Let’s read in a file.

To begin, in your file window, click the New Folder icon and create a new folder named data. Next, click the upload button, and then the Choose File button, and navigate to the thaw_depth.csv file that I sent you.

This data file contains measurements of permafrost thaw depth for several field sites in northeastern Siberia. Permafrost is perennially frozen ground that occurs throughout the Arctic and sub-Arctic. During the summer each year the top soil layers thaw, and these measurements quantify the depth of this thawed layer. These measurements can help understand permafrost thaw, which has has important climate change implications



We can use the read.csv() function to read in data files that are in csv format. (Note that we can use the read.table() function to read other types of files. All of these functions require the the file name as an argument. Here we have to tell R that it is located within the data folder in my working directory.

thaw <- read.csv("data/thaw_depth.csv")

Use the help function to find out what additional arguments you might use with the read.csv function.

We can use the head and tail functions to look at the first and last few rows of this dataset, respectively.

head(thaw)
##   site   date transect location td tussock
## 1  bfg 8/4/19        1        0 42       0
## 2  bfg 8/4/19        1        1 50       0
## 3  bfg 8/4/19        1        2 51       0
## 4  bfg 8/4/19        1        3 43       0
## 5  bfg 8/4/19        1        4 36       0
## 6  bfg 8/4/19        1        5 19       0
tail(thaw)
##     site   date transect location td tussock
## 159  wws 8/8/19        1       95 68       0
## 160  wws 8/8/19        1       96 48       0
## 161  wws 8/8/19        1       97 53       0
## 162  wws 8/8/19        1       98 39       0
## 163  wws 8/8/19        1       99 41       0
## 164  wws 8/8/19        1      100 50       0

We can use matrix notation to access data elements within an object: td[row,column]

# What is the value in the row 1 column 1
thaw[1,1]
## [1] "bfg"
# Or the values in rows 1-5 of column 5
thaw[1:5,5]
## [1] 42 50 51 43 36

Notice here that we have different types of data. Data in the first column is text-based, while the fifth column is numeric. It turns out our data stored as a special kind of matrix called a dataframe. We can get some more information about it using the str() function.

str(thaw)
## 'data.frame':    164 obs. of  6 variables:
##  $ site    : chr  "bfg" "bfg" "bfg" "bfg" ...
##  $ date    : chr  "8/4/19" "8/4/19" "8/4/19" "8/4/19" ...
##  $ transect: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ location: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ td      : int  42 50 51 43 36 19 41 35 38 30 ...
##  $ tussock : int  0 0 0 0 0 0 0 0 0 0 ...

There are a couple of important things to note here. The first is that with a data frame we can have different types of data. For example try converting our td data frame to a matrix using the as.matrix() function. What happens when you do this? Why do you think that is?

as.matrix(thaw)

Second, you can access variables, or columns within a data frame using the following syntax.

thaw$td
##   [1] 42 50 51 43 36 19 41 35 38 30 45 28 30 37 58 47 25 33 36 54 54 33 50 53 50
##  [26] 48 42 36 41 57 32 42 60 66 51 45 44 38 21 41 60 64 41 44 69 57 32 27 43 45
##  [51] 30 53 60 33 52 30 31 61 64 33 50 65 22 65 55 49 56 39 47 56 58 48 42 45 42
##  [76] 33 57 40 48 57 32 47 56 41 48 53  6 57 35 53 49 31 50 40 40 52 57 39 39 50
## [101] 56 45 40 50 61 48 53 48 33 24 44 57 20 60 57 48 57 57 26 42 46 58 49 51 42
## [126] 54 60 40 56 48 48 57 41 41 69 37 39 49 61 38 55 36 39 42 50 57 40 47 44 36
## [151] 29 65 45 13 58 27 46 34 68 48 53 39 41 50

Another handy thing that R can do is evaluate logical statements. For example, what happens when we look for thaw depths that are less than 50?

thaw$td<50

What does this output mean?

What happens when we put this logical statement inside matrix notation?

thaw[thaw$td<50,]

Here is a list of logical operators in R.

  <   less than
  <=  less than or equal to
  >   greater than 
  >=  greater than or equal to
  ==  exactly equal to
  !=  not equal to
  !x  not x 
  x|y x or y
  x&y x and y
  


It is also possible to visualize data in R. This is another important part of the data analyses we will be learning this semester. Let’s plot thaw depth values along the transect at the wws site.

plot(thaw$location[thaw$site=="wws"],thaw$td[thaw$site=="wws"])

This is nice, but we need to clean it up a bit by adding axis labels, specifying units, and making things a bit easier to see.

Homework 2

For this homework we will begin to explore a weather data set that we will use in the coming weeks as we learn about probability. Please answer the following questions. For this homework, you should create a new ‘homework2’ script in your RStudio Cloud work space. Your script should be well commented, easy to understand, and not contain any extraneous or unnecessary code. For questions requiring a narrative response, record your answers in a separate document that you will submit to Moodle as an MS word or PDF document.


Before we go any further, we are going to want to change the date format from a factor so it is a little more useful.

#specify a column with a proper date format
#note the format here dataframe$column
datW$dateF <- as.Date(datW$DATE, "%Y-%m-%d")
#google date formatting in r to find more options and learn more


We also will find it helpful to have a year column that is numeric for working with the data. You can use the format function. This will output a date formatted to just include year in a character data type. Since this data is simply a numeric date, we will indicate that it should actually be numeric using the as.numeric function. Note how R lets you nest functions here.

#create a date column by reformatting the date to only include years
#and indicating that it should be treated as numeric data
datW$year <- as.numeric(format(datW$dateF,"%Y"))

Notice again that we used data frame notation to add a new variable, or column, to our data set. You may have figured out that our data set includes daily maximum and minimum temperatures. Average daily temperature is often more helpful to evaluate temperature, let’s calculate that before we move on. The average is always halfway between the minimum and maximum.


Now we can use the aggregate function to calculate means of TAVE for each site. To do this we use the site name as an indexing value.

#get the mean across all sites
#the by function is a list of one or more variables to index over.
#FUN indicates the function we want to use
#if you want to specify any function specific arguments use a comma and add them after the function
#here we want to use the na.rm arguments specific to 
averageTemp <- aggregate(datW$TAVE, by=list(datW$NAME), FUN="mean",na.rm=TRUE)
averageTemp
#change the automatic output of column names to be more meaningful
#note that MAAT is a common abbreviation for Mean Annual Air Temperature
colnames(averageTemp) <- c("NAME","MAAT")
averageTemp