Introduction

This refresh and self-assessment should take about 45-60 min of your time. We will start by touching up on some R programming skills, and then we'll move onto some challenges which should let you self-assess your R skills.

Few notes:

  • It is not expected that you know all the material in the overview section nor being able to answer all the questions. the goal is to trigger your curiosity and offer you the opportunity to do some self-learning before the training.
  • Our workshop will be focusing on the use of the Tidyverse packages which promote a specific way of handling data in R; thus do not worry if you are not familiar with some of the more R base syntax in this tutorial.

Finally, this is not a test, your answers are not recorded and you can try things as many times as you want. If you have questions or concerns about your assessment, please contact the person organizing your training.



Acknowledgement: This version of the assessment has been adapted from Bryce Mecum's version of OSS 2017 self-assessment by Nathan Hwangbo, Robert Saldivar, Jeanette Clark and Julien Brun, NCEAS.

Learning Outcomes

  • Give everyone the opportunity to self-assess and refresh their R skills at their own pace before the training
  • Prepare the training by reading some extra material on specific topics of interest

R overview

This overview will go over:

  • Basic R syntax
  • Variables & assignment
  • Control flow (if, then, for loops)
  • Some data manipulations with and without the tidyverse

The assignment operator, <-

One of the things we'll do all the time is save some value to a variable. Here, we save the word "apple" to a variable called fruit

fruit <- "apple"
fruit
## [1] "apple"

Notice the last line with just fruit on it. Typing just the variable name in just prints the value to the Console.

R has a flexible syntax (read blank spaces do not matter). The following two lines of code are identical to the above one.

fruit<-"apple"
fruit    <-     "apple"

However the syntax fruit <- "apple" is the recommended one. **See here for more information about Recommended R syntax: https://style.tidyverse.org/syntax.html**

R as a calculator: + - * / > >= %% %/% etc

2 + 2
## [1] 4
2 * 3
## [1] 6
2 ^ 3
## [1] 8
5 / 2
## [1] 2.5

Comparison:

2 == 1
## [1] FALSE
2 == 2
## [1] TRUE
3 > 2
## [1] TRUE
2 < 3 # Same as above
## [1] TRUE
"apple" == "apple"
## [1] TRUE
"apple" == "pair"
## [1] FALSE
"pair" == "apple" # Order doesn't matter for ==
## [1] FALSE

Types of variables

Vectors

When we run a line of code like this:

x <- 2

We're assigning 2 to a variable x. x is a variable but it is also a "numeric vector" of length 1.

class(x)
## [1] "numeric"
length(x)
## [1] 1

Above, we ran two function: class and length on our variable x. Running functions is a very common thing you'll do in R. Every function has a name, following by a pair of () with something inside.

We can make a numeric vector that is longer like so:

x <- c(1, 2, 3) # Use the `c` function to put things together

Notice we can also re-define a variable at a later point just like we did above.

class(x)
## [1] "numeric"
length(x)
## [1] 3

R can store much more than just numbers though. Let's start with strings of characters, which we've already seen:

fruit <- "apple"
class(fruit)
## [1] "character"
length(fruit)
## [1] 1

Depending on your background, you may be surprised that the result of running length(fruit) is 1 because "apple" is five characters long.

It turns out that fruit is a character vector of length one, just like our numeric vector from before. To find out the number of characters in "apple", we have to use another function:

nchar(fruit)
## [1] 5
nchar("apple")
## [1] 5

Let's make a character vector of more than length one and take a look at how it works:

fruits <- c("apple", "banana", "strawberry")
length(fruits)
## [1] 3
nchar(fruits)
## [1]  5  6 10
fruits[1]
## [1] "apple"

Smushing character vectors together can be done with paste:

paste("key", "lime", "pie")
## [1] "key lime pie"

Vectors always have to be one class -- we have just seen examples of numeric and character vectors. What happens when we try to mash them together? Instead of throwing an error, R will first try to change force (called coercion) the elements to all be the same class. For instance, the numbers in the code below are forced to be characters:

x <- c(1, 2, "three")

class(x)
## [1] "character"

This tells us that the character class takes over the numeric class. In general, the coercion order is approximately logical < numeric < character.

What if we don't like this order? In particular, what if we wanted to force x to be a numeric vector instead of a character vector? We can try doing this using the as.numeric() function.

x <- c(1, 2, "three")
as.numeric(x)
## Warning: NAs introduced by coercion
## [1]  1  2 NA

The answer is that R will try its best to figure out what number "three" is referring to. In this case, R couldn't figure it out, so it changed "three" to NA. But now consider this example:

y <- c(1, 2, "3")

# y is a character vector
class(y)
## [1] "character"

# but we can turn it into a numeric vector without NAs
as.numeric(y)
## [1] 1 2 3

R's output can change based on what we're using as the input, so we have to be careful to keep track of what our inputs and how R will handle them.

A defining feature of R is its vectorized functions -- Many functions in R work on a vector element by element. For example, the code below adds the vectors x and y by adding each component, so that the output is also a vector.

x <- c(1,2,3)
y <- c(4,5,6)

x + y
## [1] 5 7 9

Not all functions are vectorized, however, so we have to make sure to check which are and which aren't. For instance, the logical "AND" operator has two forms: & IS vectorized, but && is not (it compares only the first element of each vector).

# Returns a vector
c(TRUE, FALSE) & c(TRUE, TRUE)
## [1]  TRUE FALSE

# Returns a single value
c(TRUE, FALSE) && c(TRUE, TRUE)
## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'

## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'
## [1] TRUE

Lists

Vectors and lists look similar in R sometimes but they have very different uses. Notice that while all elements of a vector must be the same class, elements of a list can be whatever class they want:

c(1, "apple", 3)
## [1] "1"     "apple" "3"
list(1, "apple", 3)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "apple"
## 
## [[3]]
## [1] 3

data.frames

Most of the time when doing analysis in R you will be working with data.frames. data.frames are used to store tabular data, with column headings and rows of data, just like a CSV file.

We create new data.frames with a relevantly-named function:

mydata <- data.frame(site = c("A", "B", "C"),
                     temp = c(20, 30, 40))
mydata

FYI: The "tidyverse" suite of packages offers the tibble() as an alternative to a data.frame, which is often easier to look at in the console.

Or we can read in a CSV from the file system and turn it into a data.frame in order to work with it in R:

mydata <- read.csv("data.csv")
mydata

FYI: The tidyverse package readr offers the alternative read_csv("data.csv"), which is faster than read.csv() for large datasets

We can find out how many rows of data mydata has in it:

nrow(mydata)
## [1] 5

We can return just one of the columns:

mydata$type
## [1] "fruit"     "vegetable" "fruit"     "vegetable" "fruit"
unique(mydata$type)
## [1] "fruit"     "vegetable"

If we want to sort mydata, we use the order function (in kind of a weird way):

mydata[order(mydata$type),]

Let's break the above command down a bit. We can access the individual cells of a data.frame with a new syntax element: [ and ]:

mydata[1,] # First row
mydata[,1] # First column
## [1] "fruit"     "vegetable" "fruit"     "vegetable" "fruit"
mydata[1,1] # First row, first column
## [1] "fruit"
mydata[c(1,5),] # First and fifth row
mydata[,"type"] # Column named 'type'
## [1] "fruit"     "vegetable" "fruit"     "vegetable" "fruit"
mydata$type # we can also use '$' to achieve a similar result in a more compact synthax
## [1] "fruit"     "vegetable" "fruit"     "vegetable" "fruit"

So what does that order function do?

?order # How to get help in R!
order(c(1, 2, 3))
## [1] 1 2 3
order(c(3, 2, 1))
## [1] 3 2 1
order(mydata$type)
## [1] 1 3 5 2 4

So order(mydata$type) is returning the rows of mydata, by row number, in sorted order. Finally, mydata[order(mydata$type),] is rearranging the rows of mydata by this order.

FYI: A Tidyverse, which is a set of very handy R packages to perform data manipulations, solution to this problem uses the arrange() function in the dplyr package via arrange(mydata, type)

mydata[order(mydata$type),]  # using the base R syntax

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
arrange(mydata, type)  # using the tidyverse syntax

We can also return just certain rows, based upon criteria:

mydata[mydata$type == "fruit",]
mydata$type == "fruit"
## [1]  TRUE FALSE  TRUE FALSE  TRUE

Similarly, we could use again the Tidyverse to perform such operation using the filter() function from the dplyr package:

library(dplyr) # note necessary because we already attached this package previously
filter(mydata, type == "fruit")

In this case, instead of indexing the rows by number, we're using TRUEs and FALSEs results of the condition type is equal (note the ==) to fruit.

Exercise: Using the tidyverse syntax, subset mydata to the vegetables instead of the fruit

We can also use the subset function to not only filter out certain rows, but also to filter out certain columns. This is done using the select argument in the function:

FYI: A tidyverse solution to this problem uses the select() function in the "dplyr" package via select(mydata, type). Again, the primary difference between the two approaches is in readability, especially when selecting multiple columns. (Also, )

If instead of filtering, we want to add more data to a data.frame, we can use the rbind() and cbind() functions, which allow us to add rows and columns, respectively.

For instance, we can add "brussels sprouts" to mydata using rbind()

rbind(mydata, c("vegetable", "brussels sprouts"))

In the same way, we can add a column to the data frame using cbind(). This time, our new column will be the number of characters in the name column, using the nchar() function introduced earlier

cbind(mydata, num_letters = nchar(mydata$name))

FYI: A Tidyverse solution for adding the column num_letters could be found using the mutate() function in the dplyr package, via mutate(mydata, num_letters = nchar(name))

mutate(mydata, num_letters = nchar(name))

Notice that "brussels sprouts" didn't show up in this data frame! These functions do not change the original data frame, mydata. Instead, they create a copy of mydata and then make the changes. To save your modifications, overwrite the mydata object like so:

mydata <- rbind(mydata, c("vegetable", "brussels sprouts"))

mydata

There are a lot of useful functions to help us work with data.frames. When seeing a new data frame for the first time, it might be helpful to look at the structure of the data to get a quick idea of what you're working with

# the dimensions of the dataframe as (# rows, # columns)
dim(mydata)
## [1] 6 2

# the structure of the dataframe
str(mydata)
## 'data.frame':    6 obs. of  2 variables:
##  $ type: chr  "fruit" "vegetable" "fruit" "vegetable" ...
##  $ name: chr  "apple" "eggplant" "orange" "beet" ...

# information about each column
summary(mydata)
##      type               name          
##  Length:6           Length:6          
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Controls Statements

if, else

Sometimes you want to be able to run some code only if some condition is satisfied. if statements allow us to do exactly that:

# prints output when i is 2
i <- 2
if(i == 2){
  print("i is 2")
}
## [1] "i is 2"

# doens't print anything when i isn't 2
i <- 3
if(i == 2){
  print("i is 2")
}

Notice that nothing is printed when i <- 3. If we want to print a different output in this case, we can add an else statement:

i <- 3
if(i == 2){
  print("i is 2")
} else{
  print("i is NOT 2")
}
## [1] "i is NOT 2"

If we need more than 2 cases, we can add else if statements:

i <- 3
if(i == 2){
  print("i is 2")
} else if (i == 3){
  print("i is not 2, but i is 3")
} else{
  print("i is not 2 or 3")
}
## [1] "i is not 2, but i is 3"

for

If we wanted to make sure our if statements above worked correctly, it would be nice if we could easily try different values of i to cover all the cases (e.g. i = 2, i = 3, i = 4). for loops allow us to do exactly this! To get a sense of how it works, let's try printing the numbers 2, 3, and 4.

for(i in c(2,3,4)){
  print(i)
}
## [1] 2
## [1] 3
## [1] 4

Now, to test our code the if statement in the last section:

for(i in c(2,3,4)){
  
  if(i == 2){
    print("i is 2")
  } else if (i == 3){
    print("i is not 2, but i is 3")
  } else{
    print("i is not 2 or 3")
  }
  
}
## [1] "i is 2"
## [1] "i is not 2, but i is 3"
## [1] "i is not 2 or 3"

Packages

One of the advantages to using R is the large open source developer community. This comes in the form of "packages", which contain libraries of code for us to use. Packages from CRAN (the most popular collection of packages) can be easily installed via the install.packages() function.

For example, we can install the tidyverse package "dplyr" using the following command:

install.packages("dplyr")

To tell R we are using a function from a particular package, we add the package name before the function. For example, to use the tibble function from dplyr, we can write:

dplyr::tibble(x = c(1,2,3))

To avoid having to write the additional <package name>:: every time we want to use a function from a package, we can use the library() function.

The code below tells R that every time we write tibble (or any other dplyr function, like group_by() or filter()), that we are referring to the function in the dplyr package

library(dplyr)

This allows us to simplify the code above as:

tibble(x = c(1,2,3))

WARNING: The order that packages are loaded in matters! Notice that when we loaded dplyr, we got a message saying that filter and lag were masked from the stats package. This means that both dplyr and stats contain functions called filter and lag, and that whenever we type the command filter(), R will now use dplyr's function instead of stats'. If both packages were loaded in via the library() function, R will use the function from the more recent library() call. If we want to use the filter() function from the stats package, we would now have to type stats::filter()

Assessment

Everyone will arrive to this assessment with different experiences with R. Skill with R doesn't necessarily exist a continuum and can instead be thought of as a set of tools. Thus each participant will start our workshop with different tools and we will all be able to learn from each others!

There are no expectations that participants will know all this material. Our goal is to refresh some concepts and offer you the opportunity to research some of these topics before joining the meeting. Feel free to reach out with any questions!!

Instructions:

Answer the following 15 questions to the best of your knowledge and keep track of the topics you think will be good for you to further review. You will find at the end of this assessment a set of resources to help you to do so.

Question 1

x <- 2
x ^ 2

Question 2

Which line of code can be used to read in the data.csv file? There are 2 correct answers, but only 1 is required:

Question 3

What does the following expression return?

max(abs(c(-5, 1, 5)))

Question 4

If x and y are both data.frames defined by:

x <- data.frame(z = 1:2)
y <- data.frame(z = 3)

which of the following expressions would be a correct way to combine them into one data.frame that looks like this:

z
-
1
2
3

(i.e. one column with the numbers 1, 2, and 3 in it)

Question 5

What is the output of the following code chunk?

x <- data.frame(x = 1:10, y = 1:10)
dim(x)

Question 6

What is the output of the following code chunk?

x <- data.frame(x = 1:3)
y <- data.frame(x = 1:7)
z <- cbind(x, y)
nrow(z)

Question 7

Use the following data.frame iris to answer the next question.

Which expression return a data.frame with only the columns Sepal.Length and Species?

There are 2 correct answers, but only 1 is required:

Question 8

Still using the same data.frame iris to answer the next question.

Which expression return a data.frame with rows where Species is "setosa" ?

There are 2 correct answers, but only 1 is required:

For the questions below, unless otherwise specified, select the output of the following code chunks.

Question 9

x <- "hello"
y <- "world"
paste(x, y, sep = " ")

Question 10

x <- NA

if (is.na(x)) {
  print("conservation")
} else {
  print("nature")
}

Question 11

What will the following code print?

numbers <- seq(1, 3)
count <- 0
for (number in numbers) {
  count <- count + number
}
print(count)

Question 12

x <- c(1, "2", 3)
class(x)

Question 13

x <- c(1, "A", 3)
as.numeric(x)

Question 14

x <- c(1, 2, NA, 4, NA)
sum(is.na(x))

Question 15

Suppose all of these packages contain a function called select(). Which package will the function select() be called from when the packages are loaded in the following order:

library(tidyverse)
library(MASS)
library(Select)
library(dplyr)

Summary

By the end of this self-assessment, you should have feel touched up on your general R skills and you also should have seen some of the trickier parts of R. Hopefully having seen the trickier parts of R will help later on down the road and pick your curiosity to learn more using the resources on the next page.

Again feel free to reach out with any questions!!

Resources