This refresh and selfassessment should take about 4560 min of your time. We will start by touching up on some R programming skills, and then we'll move onto some challenges which should let you selfassess your R skills.
Tidyverse
packages which promote a specific way of handling data in R; thus do not worry if you are not familiar with some of the more R base syntax in this tutorial.Finally, this is not a test, your answers are not recorded and you can try things as many times as you want. If you have questions or concerns about your assessment, please contact the person organizing your training.
Acknowledgement: This version of the assessment has been adapted from Bryce Mecum's version of OSS 2017 selfassessment by Nathan Hwangbo, Robert Saldivar, Jeanette Clark and Julien Brun, NCEAS.
This overview will go over:
<
One of the things we'll do all the time is save some value to a variable. Here, we save the word "apple" to a variable called fruit
fruit < "apple"
fruit
## [1] "apple"
Notice the last line with just fruit
on it. Typing just the variable name in just prints the value to the Console.
R has a flexible syntax (read blank spaces do not matter). The following two lines of code are identical to the above one.
fruit<"apple"
fruit < "apple"
However the syntax fruit < "apple"
is the recommended one. **See here for more information about Recommended R syntax: https://style.tidyverse.org/syntax.html**
+  * / > >= %% %/%
etc2 + 2
## [1] 4
2 * 3
## [1] 6
2 ^ 3
## [1] 8
5 / 2
## [1] 2.5
Comparison:
2 == 1
## [1] FALSE
2 == 2
## [1] TRUE
3 > 2
## [1] TRUE
2 < 3 # Same as above
## [1] TRUE
"apple" == "apple"
## [1] TRUE
"apple" == "pair"
## [1] FALSE
"pair" == "apple" # Order doesn't matter for ==
## [1] FALSE
When we run a line of code like this:
x < 2
We're assigning 2 to a variable x
. x
is a variable but it is also a "numeric vector" of length 1.
class(x)
## [1] "numeric"
length(x)
## [1] 1
Above, we ran two function: class
and length
on our variable x
. Running functions is a very common thing you'll do in R. Every function has a name, following by a pair of ()
with something inside.
We can make a numeric vector that is longer like so:
x < c(1, 2, 3) # Use the `c` function to put things together
Notice we can also redefine a variable at a later point just like we did above.
class(x)
## [1] "numeric"
length(x)
## [1] 3
R can store much more than just numbers though. Let's start with strings of characters, which we've already seen:
fruit < "apple"
class(fruit)
## [1] "character"
length(fruit)
## [1] 1
Depending on your background, you may be surprised that the result of running length(fruit)
is 1 because "apple" is five characters long.
It turns out that fruit
is a character vector of length one, just like our numeric vector from before. To find out the number of characters in "apple", we have to use another function:
nchar(fruit)
## [1] 5
nchar("apple")
## [1] 5
Let's make a character vector of more than length one and take a look at how it works:
fruits < c("apple", "banana", "strawberry")
length(fruits)
## [1] 3
nchar(fruits)
## [1] 5 6 10
fruits[1]
## [1] "apple"
Smushing character vectors together can be done with paste
:
paste("key", "lime", "pie")
## [1] "key lime pie"
Vectors always have to be one class  we have just seen examples of numeric and character vectors. What happens when we try to mash them together? Instead of throwing an error, R
will first try to change force (called coercion) the elements to all be the same class. For instance, the numbers in the code below are forced to be characters:
x < c(1, 2, "three")
class(x)
## [1] "character"
This tells us that the character class takes over the numeric class. In general, the coercion order is approximately logical < numeric < character.
What if we don't like this order? In particular, what if we wanted to force x
to be a numeric vector instead of a character vector? We can try doing this using the as.numeric()
function.
x < c(1, 2, "three")
as.numeric(x)
## Warning: NAs introduced by coercion
## [1] 1 2 NA
The answer is that R
will try its best to figure out what number "three" is referring to. In this case, R
couldn't figure it out, so it changed "three" to NA
. But now consider this example:
y < c(1, 2, "3")
# y is a character vector
class(y)
## [1] "character"
# but we can turn it into a numeric vector without NAs
as.numeric(y)
## [1] 1 2 3
R
's output can change based on what we're using as the input, so we have to be careful to keep track of what our inputs and how R
will handle them.
A defining feature of R
is its vectorized functions  Many functions in R
work on a vector element by element. For example, the code below adds the vectors x
and y
by adding each component, so that the output is also a vector.
x < c(1,2,3)
y < c(4,5,6)
x + y
## [1] 5 7 9
Not all functions are vectorized, however, so we have to make sure to check which are and which aren't. For instance, the logical "AND" operator has two forms: &
IS vectorized, but &&
is not (it compares only the first element of each vector).
# Returns a vector
c(TRUE, FALSE) & c(TRUE, TRUE)
## [1] TRUE FALSE
# Returns a single value
c(TRUE, FALSE) && c(TRUE, TRUE)
## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'
## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'
## [1] TRUE
Vectors and lists look similar in R sometimes but they have very different uses. Notice that while all elements of a vector must be the same class, elements of a list can be whatever class they want:
c(1, "apple", 3)
## [1] "1" "apple" "3"
list(1, "apple", 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] "apple"
##
## [[3]]
## [1] 3
Most of the time when doing analysis in R you will be working with data.frames
. data.frames
are used to store tabular data, with column headings and rows of data, just like a CSV file.
We create new data.frames
with a relevantlynamed function:
mydata < data.frame(site = c("A", "B", "C"),
temp = c(20, 30, 40))
mydata
FYI: The "tidyverse" suite of packages offers the tibble()
as an alternative to a data.frame, which is often easier to look at in the console.
Or we can read in a CSV from the file system and turn it into a data.frame
in order to work with it in R:
mydata < read.csv("data.csv")
mydata
FYI: The tidyverse package readr
offers the alternative read_csv("data.csv")
, which is faster than read.csv()
for large datasets
We can find out how many rows of data mydata
has in it:
nrow(mydata)
## [1] 5
We can return just one of the columns:
mydata$type
## [1] "fruit" "vegetable" "fruit" "vegetable" "fruit"
unique(mydata$type)
## [1] "fruit" "vegetable"
If we want to sort mydata
, we use the order
function (in kind of a weird way):
mydata[order(mydata$type),]
Let's break the above command down a bit. We can access the individual cells of a data.frame
with a new syntax element: [
and ]
:
mydata[1,] # First row
mydata[,1] # First column
## [1] "fruit" "vegetable" "fruit" "vegetable" "fruit"
mydata[1,1] # First row, first column
## [1] "fruit"
mydata[c(1,5),] # First and fifth row
mydata[,"type"] # Column named 'type'
## [1] "fruit" "vegetable" "fruit" "vegetable" "fruit"
mydata$type # we can also use '$' to achieve a similar result in a more compact synthax
## [1] "fruit" "vegetable" "fruit" "vegetable" "fruit"
So what does that order
function do?
?order # How to get help in R!
order(c(1, 2, 3))
## [1] 1 2 3
order(c(3, 2, 1))
## [1] 3 2 1
order(mydata$type)
## [1] 1 3 5 2 4
So order(mydata$type)
is returning the rows of mydata
, by row number, in sorted order. Finally, mydata[order(mydata$type),]
is rearranging the rows of mydata
by this order.
FYI: A Tidyverse
, which is a set of very handy R packages to perform data manipulations, solution to this problem uses the arrange()
function in the dplyr
package via arrange(mydata, type)
mydata[order(mydata$type),] # using the base R syntax
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
arrange(mydata, type) # using the tidyverse syntax
We can also return just certain rows, based upon criteria:
mydata[mydata$type == "fruit",]
mydata$type == "fruit"
## [1] TRUE FALSE TRUE FALSE TRUE
Similarly, we could use again the Tidyverse
to perform such operation using the filter()
function from the dplyr
package:
library(dplyr) # note necessary because we already attached this package previously
filter(mydata, type == "fruit")
In this case, instead of indexing the rows by number, we're using TRUEs and FALSEs results of the condition type is equal (note the ==
) to fruit
.
Exercise: Using the tidyverse
syntax, subset mydata
to the vegetables instead of the fruit
We can also use the subset
function to not only filter out certain rows, but also to filter out certain columns. This is done using the select
argument in the function:
FYI: A tidyverse solution to this problem uses the select()
function in the "dplyr" package via select(mydata, type). Again, the primary difference between the two approaches is in readability, especially when selecting multiple columns. (Also, )
If instead of filtering, we want to add more data to a data.frame
, we can use the rbind()
and cbind()
functions, which allow us to add rows and columns, respectively.
For instance, we can add "brussels sprouts" to mydata
using rbind()
rbind(mydata, c("vegetable", "brussels sprouts"))
In the same way, we can add a column to the data frame using cbind()
. This time, our new column will be the number of characters in the name column, using the nchar()
function introduced earlier
cbind(mydata, num_letters = nchar(mydata$name))
FYI: A Tidyverse solution for adding the column num_letters
could be found using the mutate()
function in the dplyr
package, via mutate(mydata, num_letters = nchar(name))
mutate(mydata, num_letters = nchar(name))
Notice that "brussels sprouts" didn't show up in this data frame! These functions do not change the original data frame, mydata
. Instead, they create a copy of mydata
and then make the changes. To save your modifications, overwrite the mydata
object like so:
mydata < rbind(mydata, c("vegetable", "brussels sprouts"))
mydata
There are a lot of useful functions to help us work with data.frame
s. When seeing a new data frame for the first time, it might be helpful to look at the structure of the data to get a quick idea of what you're working with
# the dimensions of the dataframe as (# rows, # columns)
dim(mydata)
## [1] 6 2
# the structure of the dataframe
str(mydata)
## 'data.frame': 6 obs. of 2 variables:
## $ type: chr "fruit" "vegetable" "fruit" "vegetable" ...
## $ name: chr "apple" "eggplant" "orange" "beet" ...
# information about each column
summary(mydata)
## type name
## Length:6 Length:6
## Class :character Class :character
## Mode :character Mode :character
if
, else
Sometimes you want to be able to run some code only if some condition is satisfied. if
statements allow us to do exactly that:
# prints output when i is 2
i < 2
if(i == 2){
print("i is 2")
}
## [1] "i is 2"
# doens't print anything when i isn't 2
i < 3
if(i == 2){
print("i is 2")
}
Notice that nothing is printed when i < 3
. If we want to print a different output in this case, we can add an else
statement:
i < 3
if(i == 2){
print("i is 2")
} else{
print("i is NOT 2")
}
## [1] "i is NOT 2"
If we need more than 2 cases, we can add else if
statements:
i < 3
if(i == 2){
print("i is 2")
} else if (i == 3){
print("i is not 2, but i is 3")
} else{
print("i is not 2 or 3")
}
## [1] "i is not 2, but i is 3"
for
If we wanted to make sure our if
statements above worked correctly, it would be nice if we could easily try different values of i
to cover all the cases (e.g. i = 2
, i = 3
, i = 4
). for
loops allow us to do exactly this! To get a sense of how it works, let's try printing the numbers 2, 3, and 4.
for(i in c(2,3,4)){
print(i)
}
## [1] 2
## [1] 3
## [1] 4
Now, to test our code the if
statement in the last section:
for(i in c(2,3,4)){
if(i == 2){
print("i is 2")
} else if (i == 3){
print("i is not 2, but i is 3")
} else{
print("i is not 2 or 3")
}
}
## [1] "i is 2"
## [1] "i is not 2, but i is 3"
## [1] "i is not 2 or 3"
One of the advantages to using R
is the large open source developer community. This comes in the form of "packages", which contain libraries of code for us to use. Packages from CRAN (the most popular collection of packages) can be easily installed via the install.packages()
function.
For example, we can install the tidyverse package "dplyr" using the following command:
install.packages("dplyr")
To tell R
we are using a function from a particular package, we add the package name before the function. For example, to use the tibble
function from dplyr
, we can write:
dplyr::tibble(x = c(1,2,3))
To avoid having to write the additional <package name>::
every time we want to use a function from a package, we can use the library()
function.
The code below tells R
that every time we write tibble
(or any other dplyr
function, like group_by()
or filter()
), that we are referring to the function in the dplyr
package
library(dplyr)
This allows us to simplify the code above as:
tibble(x = c(1,2,3))
WARNING: The order that packages are loaded in matters! Notice that when we loaded dplyr
, we got a message saying that filter
and lag
were masked from the stats
package. This means that both dplyr
and stats
contain functions called filter
and lag
, and that whenever we type the command filter()
, R
will now use dplyr
's function instead of stats
'. If both packages were loaded in via the library()
function, R
will use the function from the more recent library()
call. If we want to use the filter()
function from the stats
package, we would now have to type stats::filter()
Everyone will arrive to this assessment with different experiences with R. Skill with R doesn't necessarily exist a continuum and can instead be thought of as a set of tools. Thus each participant will start our workshop with different tools and we will all be able to learn from each others!
There are no expectations that participants will know all this material. Our goal is to refresh some concepts and offer you the opportunity to research some of these topics before joining the meeting. Feel free to reach out with any questions!!
Instructions:
Answer the following 15 questions to the best of your knowledge and keep track of the topics you think will be good for you to further review. You will find at the end of this assessment a set of resources to help you to do so.
x < 2
x ^ 2
Which line of code can be used to read in the data.csv
file? There are 2 correct answers, but only 1 is required:
What does the following expression return?
max(abs(c(5, 1, 5)))
If x and y are both data.frames defined by:
x < data.frame(z = 1:2)
y < data.frame(z = 3)
which of the following expressions would be a correct way to combine them into one data.frame that looks like this:
z

1
2
3
(i.e. one column with the numbers 1, 2, and 3 in it)
What is the output of the following code chunk?
x < data.frame(x = 1:10, y = 1:10)
dim(x)
What is the output of the following code chunk?
x < data.frame(x = 1:3)
y < data.frame(x = 1:7)
z < cbind(x, y)
nrow(z)
Use the following data.frame iris
to answer the next question.
Which expression return a data.frame
with only the columns Sepal.Length
and Species
?
There are 2 correct answers, but only 1 is required:
Still using the same data.frame iris
to answer the next question.
Which expression return a data.frame
with rows where Species is "setosa" ?
There are 2 correct answers, but only 1 is required:
For the questions below, unless otherwise specified, select the output of the following code chunks.
x < "hello"
y < "world"
paste(x, y, sep = " ")
x < NA
if (is.na(x)) {
print("conservation")
} else {
print("nature")
}
What will the following code print?
numbers < seq(1, 3)
count < 0
for (number in numbers) {
count < count + number
}
print(count)
x < c(1, "2", 3)
class(x)
x < c(1, "A", 3)
as.numeric(x)
x < c(1, 2, NA, 4, NA)
sum(is.na(x))
Suppose all of these packages contain a function called select()
. Which package will the function select()
be called from when the packages are loaded in the following order:
library(tidyverse)
library(MASS)
library(Select)
library(dplyr)
By the end of this selfassessment, you should have feel touched up on your general R skills and you also should have seen some of the trickier parts of R. Hopefully having seen the trickier parts of R will help later on down the road and pick your curiosity to learn more using the resources on the next page.
Again feel free to reach out with any questions!!