GEOG 331: Environmental Data Science, Colgate University

Instructions

There are 3 questions to this activity. Save your answers in word document that you will hand in on Moodle using a .pdf extension. Keep your script file in your GitHub folder and make sure that all changes are pushed to GitHub. You will include a link to this file as a part of your final question in this activity.

Learning objectives

1. Learn about file systems

2. Learn how to apply version control

3. Introduction to R

Files in Computers

An essential part of working with data is thinking about how its stored on the computer. Chapter 1 and 2 cover some of the fundamentals of data in a computing environment. Here we will focus on how you will interface with data in R. In this class, you will use files from the internet, and locally. A good first step is to understand how to tell R to find a file on your computer or the server. Click on the folder icon in the start bar to use the windows explorer to navigate around folders on the computer. Click on the Documents folder in the left Navigation bar. Right click to make a new folder and title it GitHub. We don’t always have to click to find things in folders. Computers can reference where files are located using what’s called a file path. You must reference where a file is located from the drive on the computer to each successive folder all the way to the folder it is located in. You can click on the drop down at the top of windows explorer to display the full file path as seen below.

Note the syntax for this is slightly different on a Mac, but the concept is similar.




Intro to Git

Git and the user interface GitHub allow you to keep track of all changes to your code. You just have to indicate points where you want to save the state of your code and give it a tag. Here, we’ll go through the basics of Git and GitHub. A repository groups together code and any documentation files for a users project. You can keep track of any changes to files in each repository to document each version of your code. NEVER, NEVER, NEVER, put username and password info in files on GitHub (can’t emphasize this enough). This is the easiest way to get hacked.

Go to github.com and sign in. Click on the Repositories tab and click the New green button.Create a repository called GEOG331 (yes, exactly that name with no spaces). Make it public and make sure the README box is checked as shown below.




Using GitHub with R and RStudio

There are a number of ways that people use GitHub and R. One, is to write code and scripts independently in R or RStudio, and then use the terminal window on a computer to type commands that direct GitHub. This semester, we will be using tool built into RStudio that make it a little easier to integrate R and GitHub. Before we get started, it might be helpful to note that RStudio is something we call an Integrated Development Environment, or IDE for short. It is important to remember that R is the programming language and computer platform, you can use R by itself, but you cannot use RStudio without R. RStudio is simply a nicer interface for interacting with R that includes handy features for keeping your work space organized. It turns out one of these features is the ability to integrate GitHub version control into your work flow. RStudio can also allow you to view data in cells that look similar to excel.

Before you do anything else, you need to change some settings in Rstudio. These settings can lead to confusing errors because they save and reload old workspaces. You should be writing good scripts that easily rerun, but these settings can prevent you from checking that until after you turn in your assignment!! Furthermore they lead to a lot of confusion. I estimate that about 10% of the time I help troubleshoot problems with students, these settings are the problem! Go to the menu bar and choose Tools>Global Options. In Workspace, uncheck the Restore .Rdata option, change the save workspace option to Never. In History, uncheck the Always save history option. You will have to restart your R session to load these changes. Your options should look like my screenshot below.



Before we get started we also need to tell Git who we are. We can do this in the computer’s command line. On a Mac open a finder window and navigate to Application>Utilities>Terminal, and on a PC from the start menu navigate to Windows System>Command Prompt and type the following commands.

git config --global user.name "Your Name"

Next, enter the following line, with the email address you used when you created your account on github.com:

git config --global user.email "yourEmail@emaildomain.com"

Note that these lines need to be run one at a time.

Finally, check to make sure everything looks correct by entering this line, which will return the options that you have set.

git config --global --list

Now that we have all of this information, let’s go ahead and get set up to use RStudio to begin coding with version control. Reopen RStudio and in the menu bar click File>Project option, which will open the New Project Wizard window.



Select the Version Control option, and then the Git option in the following window.



At this point, RStudio is going to want to know exactly which repository on GitHub to associate our project with. So now we need to go to the page for our GEOG331 repository on GitHub and click the green Code button, and copy the URL for our repository.



Now, go back to RStudio and past that in the Clone Git Repository window. Here you also need to specify a project directory (folder) name, and specify where on your computer you want that project folder to sit. I’ve put mine in the GitHub directory in my Documents folder, and I recommend that you do the same.



Once you’ve entered all of this information click the Create Project button, and now you should see an RStudio window that looks something like this.



Next we will open a script, from the menu bar navigate to File>New>R Script. You should see a new pane in the upper left in RStudio. In your script type the following:

print("Hello World")

Now click the file icon on the script tab, and save your file as ‘Activity1’. Notice a couple of things here: 1) Nothing happened when we typed a command in our script, and 2) it automatically saved in our GitHub repository folder. This is our first step in version control, we have made a change to our repository by adding a script.

Remember, the script is a list of commands, the console is where these get commands are executed. In order to execute our commands we can other copy and paste them in the console or put our cursor on the line where the code is and press CTRL Enter on PC or CMD Enter on Mac. Let’s try that.

print("Hello World")
## [1] "Hello World"

Great, now we have our first script. Let’s now update our GitHub repository to reflect these changes. The first thing we need to do is ‘commit’ our changes. In the Git tab in the upper right, we need to check the files that we are committing changes to, and then press the commit button. Notice that I have checked boxes for all files, because I want to add my RStudio project file to the repository as well.



This will open a new window, that shows us the changes that have been made, and gives us an option to add a message to the commit. These messages are an important record of changes that have been made to your code over time.

Once you have done this, click the Commit button, and then once your commit has finished click the up arrow that says Push. This will push all of your changes to the GitHub repository.



You should see a window like this once your updates have been pushed to GitHub.






Now we are ready to begin using R. As we discussed in class, R scripts allow you to save code in a text file to run in R. R scripts all have a .r extension. By itself, the script will do nothing. You need to actually tell R to run your code. The console runs your R code. It’s the calculator! You can type code into the console and it will run. However, you won’t be able to access that code later. That’s why we use scripts. Anything that has run in the console is a part of your working environment. The working environment saves your calculations in the computer’s temporary memory. When you close R, your working environment will disappear. However, when you write good script, you can rerun the code in the script and all of the items in your working environment will be the same as when you last used R.

There are many different ways to interface with R. You can even run scripts from the command line! I will work interchangeably between different R interfaces. You can choose how you would like to work with R.The base R program has an option for writing scripts and sending lines of code. However, the script editor doesn’t have any formatting help or color coding of different types of code. This is often not a user friendly option.

We’ll start working with R more next week and read in data. However, let’s get a feel for how R works more.

Since R is just like a sophisticated calculator, you can read in numerical operations and will get the calculations as output. Type a few different operations like the one below. Note I’ve included both the code and outputs here for an example.

# remember that this is a comment that R will ignore
# it's extremely important to use comments for documenting your code and writing notes
# let's calculate 6 raised to the 6 power
6^6
## [1] 46656
#234 + 8
234+8
## [1] 242
#15-23
15-23
## [1] -8

There’s a few things to pay attention to in the console. The > symbol always indicates a new line of code. The + symbol means your line is continuing. Your results will have numbers in brackets to describe the output. [1] Means I have a vector and my output starts on the first element of my vector (in this case its a vector of one).

You can also give objects names. Below is an example where I know I will want to use the number 2446 many times so I will give it a shorter name. R conducts all operations on that number. The <- symbol means you are assigning an object a variable name.

# name my number
a <- 6234
# multiply my number by 5
a*5
## [1] 31170
# divide my number by 3
a/3
## [1] 2078

R automatically does vector operations (think about code differently you Python and C coders!). You can assign two numbers to a variable name and do math operations on them. First you have to let R know you are making a vector of multiple numbers using c(,…,)

# make a vector of numbers
b <- c(2395,82,2947)
#divide all numbers by 2
b/2
## [1] 1197.5   41.0 1473.5


Great job getting started with version control and R! Next week we’ll begin to analyze data.