In this tutorial we will investigate the R programming language. Like any computer program, R is used to define a sequence of instructions that combine together to solve a problem or produce a desired result. Computer programs range from very simple examples with only a few lines of source code, to very complicated. For example, Photoshop CS 6 is estimated to contain around 4.5 million lines of code.
In general, programming involves the following steps:
This tutorial will provide an introduction to programming R, an interpreted programming language. R was conceived by Ross Ihaka and Robert Gentleman at the University of Aukland in 1992. R 1.0 was released in 2000, and today R is developed by the R Core Development Team. We will be using R version 4.4.x.
Since R is an interpreted language, individual lines of code are converted to machine language and executed as they are encountered (versus a compiled language, which converts an entire program to machine code in an explicit compile–link stage). One advantage of an interpreted language is the ability to enter individual commands on a command line prompt, and immediately see their results.
Unless we're only issuing a few commands once or twice, we normally store
the commands in source code files. This allows us to load and execute the
commands as often as we want. It also makes it easier to modify a program, or
correct it when we discover errors. A version of the above code is available
in the source code file tut-01-intro.R
.
For this class, we'll be using RStudio Desktop, which combines the current version of R and the RStudio IDE (integrated development environment). The most recent version of R can be downloaded online, as can RStudio Desktop. The base installation of R includes many of the standard packages we will use throughout this tutorial. We will also show you how to install and load any additional packages you might need in your R programs.
R is open source and used extensively in the academic and industrial world for statistical analysis. Its core architecture is maintained by the R Core Development Team, and various individuals have created packages to extend its capabilities to support many complex statistical algorithms. For example, the time series and Bayesian capabilities of R far exceed those found in other, more general languages like Python.
In the general pipeline of operations to convert raw data intro a set of analytical results, we often assume the steps of collection & storage → preprocessing → analysis → presentation. R falls within the analysis step of this pipeline.
Each day one lab will be available for you to use to test your knowledge of R. The lab opens at the beginning of the given day, and closes at 11:59pm on the same day.
Your homework team will also be required to complete a project. The project will be submitted after the R instruction is complete, allowing you to work on it with full knowledge of the material covered in Introduction to R. Dates you can submit the project are listed on the Moodle web page. Pay careful attention to the description of the project. You MUST submit the project as an R Markdown file.
You are free to use whatever resources you want to complete the labs and project, but remember, these are meant to help you gain proficiency in R. During the R assessment, you will not have time to read notes or query the Internet and complete all the questions you are asked. Because of this, I strongly encourage you to learn the material in the labs and project independent of external resources. It is fine to look at things like specific function names or explanations of arguments to a function, but do not expect to be able to read how a topic works and still have time to complete the assessment.
Grades for the labs and the project will count, along with your assessment, towards your final grade for the R component of the Summer 2 module. For the Introduction to R labs and project, the following grade breakdown will be used.
R comes pre-installed with numerous basic packages. However, you will
eventually need packages that are not included in the base installation. When
this happens, you will need to install the package using
install.packages( "package-name" )
where package-name
is the name of the package you want
to install. An example is
tidyverse
, which is not included with the base R installation. We
can include it by issuing the command install.packages( "tidyverse"
)
at the R command line.
To list all installed packages in your R installation, you can
use the command installed.packages()[, c(1,3:4) ]
To
determine if a package is installed, use the
command "package-nm" %in% as.list( installed.packages()[,
c(1)] )
, which will return TRUE
or FALSE
depending on whether a package with the name
"package-nm" is installed.
Finally, if you only want to see the packages you have attached with the
library()
command, or those that are automatically attached to
every R session, use the command search()
.
Every programming language provides a way to maintain values as a program runs. Usually, this is done by create a named variable, then assigning a value to the variable. In R, variables are created interactively by specifying their name and assigning them an initial value.
Unlike languages like C++, R does not require you to specify a variable's
type. This is inferred from the value it maintains. In the above
example the variables name
, birthplace
,
born
, and deceased
are inferred to be characters,
and height
and age
are inferred to be numeric. One
advantage of R's dynamically typed variables is that you can
change them to hold different types of values whenever you want. You can also
ask R what type of value a variable contains with the class()
function.
For a variety of reasons, including the object-oriented abilities of R and its heritage as an extension of S and S-Plus, there are many different ways to ask about a variable's "type" from different perspectives. Some will provide the same answer and some will differ, depending on the variable being queried. Consider the following example.
class()
returns the variable's "type" from an object-oriented
point of view. typeof()
return the type from R's point of view.
str()
returns a compact representation of the structure of the
variable. Finally, mode()
return the type based on the
Becker–Chambers–Wilks reference, which is normally defining how
the variable is stored in memory. If you find this confusing, there are even
more ways to inspect a variable's internal structure. Most programmers
recommend using str()
as the default method to inspect a
variable's type.
Here is a quick list of some of R's basic variable types. More complicated types will be discussed later in the tutorial.
c()
, for example, c( 1,
2, 3 )
.
L
, for example, var <- 10L
.var <- 9 + 3i
.var <- "Hello"
or var <-
'Code'
. Single quotes are normally used when the string
contains double quotes, for example, var <- 'He said
"No"'
.
var <- TRUE
or var <-
FALSE
.
A note on vectors. In many languages, a vector's components maintain
their original type when they are added to the vector. For example, you might
expect c( "Hello", 3.14, 3+9i )
to contain three types,
respectively: a character (or string), a numeric (or double), and a complex.
In R, however, all the entries are converted to a common type based on a
type hierarchy where R chooses the highest type of variable in the
vector. The type hierarchy from highest to lowest is expression > list >
character > complex > double > integer > logical > raw. Given
this, our vector c( "Hello", 3.14, 3+9i )
made up of character,
double, and complex has the highest type hierarchy of character, so all entries
are converted to characters.
Write a set of R statements that assign the associated group names for the following animals: Beaver: colony; Crow: murder; Parrot: pandemonium; and Porcupine: prickle to R variables, then prints four lines listing each animal and corresponding group name.
I recommend you write your program using RStudio
, save it as a
R source code file, and then test it, rather than writing the program directly
in the R shell. This will let you write your code, run it to see what it does,
edit it to fix problems, and run it again, without having to re-type the
entire program at the command line.
You can download the solution file and run it on your machine, if you want.
Your choice of variable names is probably different than ours, and you might have printed the name and phone number with slightly different formatting. Regardless, the basic idea is to use eight separate variables to store the names and phone numbers, then print the contents of these variables in combinations that produce the correct output.
You might think, "This works, but it doesn't seem very efficient." That's true. Once you've learned more about R, it's unlikely you'd write this code to solve the problem. Here's a more elegant and flexible solution. When you've finished the tutorial, you'll be able to understand, and to implement, this type of code.
R provides a set of built-in functions or operators to perform simple operations such as addition, subtraction, comparison, and boolean logic. An expression is a combination of variables, constants, and operators. Every expression has a result. Operators in R have precedence associated with them. This means expressions using operators are not evaluated strictly left to right. Results from the operators with the highest precedence are computed first. Consider the following simple R expression:
If this were evaluated left to right, the result would be 20. However, since multiplication and division have a higher precedence than addition in R, the result returned is 14, computed as follows.
6 + 3 * 4 / 2 + 2
6 + 12 / 2 + 2
6 + 6 + 2
12 + 2
14
Of course, we can use parentheses to force a result of 20, if that's what we wanted, with the following expression:
Below is a list of the common operators in R, along with an explanation of what they do. The operators are group according to precedence, from highest to lowest.
Operator | Description |
---|---|
( ) |
parentheses define the order in which groups of operators should be evaluated |
^ |
exponential |
+x , -x |
make positive, make negative |
%%, %/% |
remainder (modulus), integer division (e.g., 5 %%
3 returns 2 , 5 %/% 3 returns
1 ) |
* , / |
multiplication, division |
+ , - |
addition, subtraction |
< , <= , > ,
>= , != , ==
|
less, less or equal, greater, greater or equal, not equal, equal |
&, && |
vectorized logical AND, unary logical AND |
|, || |
vectorized logical OR, unary logical OR |
->, ->> |
rightward assignment |
<-, <<- |
leftward assignment |
= |
leftward assignment |
In addition to R operators, a number of common math functions are included in the base environment.
Math Function | Description |
---|---|
abs( x ) |
Absolute value of x |
sqrt( x ) |
Square root of x |
ceiling( x ) |
Smallest integer larger than x |
floor( x ) |
Largest integer smaller than or equal to x |
trunc( x ) |
Truncate x |
round( x, digits=n ) |
Round x to n digits of precision
(e.g., round( 3.14159, digits=1 ) returns
3.1 ) |
signif( x, digits=n )
| Maintain n significant digits in x
(e.g., signif( 3.14159, digits=3 ) returns
3.14 , signif( 314.159, digits=1 ) returns
300 ) |
cos( x ), sin( x ), tan( x ) |
Return cosine, sin, tangent of x (NB: x is
specified in radians) |
log( x ) |
Return natural log of x |
log10( x ) |
Return log base 10 of x |
exp( x ) |
Return the exponent of x |
sum( x,
na.rm=[TRUE | FALSE] ) |
Sum the values in vector x .By default NAs are not removed and sum will return
NA is one or more exist in x . Specify
na.rm=TRUE to ignore NA s
|
In addition to boolean and numeric variables, R provides a number of more complex types, including characters (or strings), lists, dictionaries, factors, matrices, and data frames. Using these types effectively will make you a much more efficient programmer.
Character variables are a sequence of one or more characters. Character
values are denoted by double quotes, s <- "abraham lincoln"
, or
single quotes, s <- 'abraham lincoln'
. Because characters can be a
sequence of multiple character values, they support more sophisticated
operations. Here are some common operations you can perform on characters.
nchar( s )
, return the number of characters in
s
.
paste( s, t, sep=" ", collapse=NULL )
, concatenate
s
and t
, returning a new character as the
result. sep
defines the character used to separate
s
and t
. By default this is the space
character. collapse
is used when s
and
t
are vectors. The default value of NULL
concatenates each vector entry separately, returning a vector as a
result. Specifying a concatenate
value concatenates the
individual vector results into a single character representation
separated by concatenate
.
substring( s, i, i )
, return the i-th
character in s
. The first character in s
is at substring( s, 1, 1 )
.
substring( s, i, j )
, slice s
and return the substring from the i-th character up to and
including the j-th character.
substring( s, i )
, slice s
and return the substring from the i-th character up to and
including the last character in s
.
Here are some examples of string operations executed in an R shell.
There are some additional operations you can perform on strings,
for example, toupper( s )
to capitalize a string, or
strsplit( s, split=" " )
to subdivide s
based on the split
character. The R documentation enumerates
the available string operations. As you can probably tell, base R
has limited string manipulation. For a more convenient and complete
set of string functions, consider a package like the stringr
package, which is included with the tidyverse
package
(install tidyverse with the command install.packages(
"tidyverse" )
, then load the package with the command
library( tidyverse )
to access its functionality).
list variables are ordered sequences of values. Unlike a vector, different data types can be stored in a list, for example, numerics, characters, or even lists themselves.
Lists are known as recursive objects in R, whereas vectors are considered atomic. This affects how different functions process their arguments. Another interesting property of lists is that they are often made up of names and corresponding values (NB: vectors can use name–value association pairs as well, although their keys and values will use the type hierarchy to ensure a common type). This is similar to dictionaries, discussed below, which store key–value pairs in ways that make searching for specific keys and their values very efficient. Lists are not efficient, unfortunately, since they use name lookup to match names to their corresponding values.
Because lists can contain different types of values, the way you
index into a list to get or set values is different than vectors. If
we create a vector v <- c( 1, 2, 3 )
then we can
retrieve the i-th element using c[ i ]
, for
example, c[ 2 ]
which returns2
. Below are
examples of creating, accessing, and examining the structure of a list
in different ways.
As with strings, there are many additional operations you can
perform on lists. To add an entry to a list, you can specify either
its name or its index: l["Greeting"]="Hello"
or
l[4]="Hello"
. Notice that in the second case with an
index value, the new list entry will have no name associated with
it. To remove an entry, set it to NULL: l[4] <-
NULL
. To query the names of each entry, use the
names()
function: names(l)
. Additional
operations exist to modify list elements and convert them into
different data types.
A note on lists. One important property of a list is the
type of result returned during indexing with square brackets.
l["greeting"]
and l[["greeting"]]
both
return "Hello"
, but if we look at their class, we see
that class(l["greeting"])
is "list"
, whereas
class(l[["greeting"]])
is
"character"
. Single brackets return a sublist. Double
brackets return a value. Consider the following code to see if you
understand how this works.
Dictionary variables are a collection of key–value pairs. This is meant to be analogous to a real dictionary, where the key is a word, and the associated value is the word's definition. Dictionaries are designed to support efficient searching for elements in the dictionary based on key.
In R, a specific dictionary data type does not exist. However, it is
possible to use R environments to mimic dictionaries, since environments are
built using a standard dictionary data structure. The Variable Assignment Solution above shows how to use
environments to do this. Semantically, however, this can be confusing,
particularly for new R users. An alternative is to use a package that provides
hash tables. Numerous packages exist, but we will use the r2r
package, which provides both hashmap
(hash tables) and
hashset
, an implement of a mathematical set which can also be
efficiently programmed using a hash table.
To install the r2r
package, you will need to install the
devtools
package and use its github interface to download and
install r2r
.
By design, dictionaries have one important requirement: every value you store in a dictionary must have its own, unique key. For example, we could not store a person's address using their last name as a key, because if two different people had the same last name, only one of their addresses could be saved in the dictionary.
To re-implement the animal–group name problem using r2r
's
hashmap
, we would use the following R code.
The first statement creates a dictionary variable named
group_nm
and assigns the four animal–group name pairs. The
next two lines asks for the keys and corresponding values stored in
group_nm
. Next, we query the value associated with the key
Beaver
, which is in group_nm
, and
Flamingo
, which is not. Notice that the query for
Flamingo
returns NULL
to indicate the key is not in
group_nm
. The final two lines show how to use the operator
%in%
to see if a given key is in the dictionary. The full documentation
for r2r
is available online.
Technical Aside: Dictionaries are a very powerful data structure. If
you need to perform efficient search, if the ordering of the element's isn't
critical, and if you can define a key for each of the elements you're storing,
a hashmap
might be a good candidate.
What does it mean to say "Dictionaries are fast?" In computing terms, we measure speed using order notation \(\textrm{O}\). Lookup, insertion, and deletion in a dictionary are \(\textrm{O}(1)\), but lookup and deletion in a list or vector are \(\textrm{O}(n)\) for a set of \(n\) values. In simple terms, this means the time for operations on a dictionary are constant no matter how big the dictionary is, but the operations on a list or vector are proportional to the size of the list or vector. If you double a list's size, on average it takes about twice as long to find or delete a value.
A factor is a finite set of categorical values or levels. Factors are used to define a variable to be set to one of the allowable categories.
Notice that R has examined the vector we convert to a factor, and automatically inferred all the unique values to determine the levels of the factor. It is possible to define the levels directly, for example, in cases where levels exist that are not part of the initial vector being converted to a factor.
Internally, factors are stored as integers starting at 1. each integer corresponds to one of the factor's category values.
If you want to name the initial values in the vector being converted to a factor, you can optionally specific labels for each category. For example, suppose we had four possible birth cities.
Consider the following code snippet, which first uses the city values directly, then maps a label to each city value.
Once a factor variable is created, it is possible to update or extend its
levels, for example, by executing levels(f) <- c(levels(f),
"Zurich")
. This will change f
's levels to Dublin
London Sofia Ponteverdra Zurich
.
You can also add values to a factor and automatically have its levels
update using append()
, similar to vectors and lists. All the
arguments to append()
must be factors for this to work, however.
If you simply try to append()
any new value to a factor, it
converts to a vector.
A matrix is a 2-dimensional data table with rows and columns. Like a 1-dimensional vector, all the values in a matrix must be of the same type. since matrix data is normally provided as a vector, the same type conversion hierarchy for vectors will be used to convert all data to a common type if different types of data are provided. A matrix is created by optionally specifying its data and its number of rows and columns.
Matrix items are accessed via indexing, providing either a row, a column, or both. The row and column specifies can be vectors, allowing you to retrieve multiple rows or columns at once.
New rows and columns can be bound to an existing matrix
using rbind()
and /cbind()
.
You can also remove rows and columns using -c( i )
where i
identifies the row or column you want to
remove.
The dim()
function returns the number of rows and
columns in a matrix, and the length()
function returns
the total number of values stored in a matrix.
Finally, the most useful property of matrix
variables
is the ability to use them to perform linear algebra operations. These
can be element-wise, where both matrices have the same number of rows
and columns, or matrix-wise, where operations like matrix
multiplication are applied. transpose, inverse, and determinant
functions are also available either directly within R or through the
pracma
package.
A data frame is a table, a two-dimensional structure where each column contains values for one attribute or property, and each row contains a sample with one value for every attribute (column). Data frames extend matrices in a way that is similar to how lists extend vectors. Perhaps most importantly, the data frame columns can contain different types of values from one another. each row and column is named, either explicitly or implicitly (e.g., you may choose to allow R to number the rows sequentially starting at 1). The following guidelines apply to data frames.
The simplest was to create a data frame is to define its column names and values during data frame initialization. The row names can either be defined or left to the default \(1 \ldots n\) values.
Data in a data frame can be summarized using the summary()
function. This returns the minimum, median, mean, maximum, and the 1st and 3rd
quartile boundaries.
Individual columns can be extracted from a data frame by using their respective names. Rows are extracted using indexing based on row and column position when a subset of columns is requested. Columns are returned as vectors. Rows are returned in a data frame.
If you want to add columns to a data frame, you can simply define a new column with a given name and assign a vector to it. The vector must be the same length as the existing columns in the data frame, that is, it must have a value for every row in the data frame. You can explicitly set the length of the vector to the number of rows in the data frame to guarantee this. If the vector is too short, missing positions will be filled with na (not available). If the vector is too long, it will be truncated to match the number of rows in the data frame.
A similar approach can be used to add rows to a data frame. First, create a
list with the value(s) you want in the new row. Set the length()
of the list to be equal to the number of columns in the data frame that will
hold the new row. Unfortunately, if the list is shorter than required, empty
positions will be filled with the string NULL
rather than
na
, so we need to convert any NULL
strings to
na
. Once this is done, we set the names of the list's entries to
match the column names of the existing data frame, then use
rbind()
to bind the data frame and the list together, producing a
result with the new row added to the end of the existing data frame.
One of the most useful operations on data frames is conditional
indexing. Here, rows in a data frame are extracted based on
conditions applied to a row's column values. Only rows whose values
meet the conditions are returned. Consider the following example,
where we use the chickwts
dataset to extract chicks fed
with sunflower seeds.
The conditional index cond_idx
contains
TRUE
for rows that meet the condition of a chick fed
sunflower seeds and FALSE
otherwise. Notice that when we
extract the subset of rows, we include a comma in the index operation,
sub_df <- df[ cond_idx, ]
. This is necessary to
extract the entire row with all its columns. The result is
twelve chicks from the original data frame. The second example shows
how you can specify multiple conditions using the boolean and operator
&
.
We've already seen that an R program runs by executing the first statement in the code and continuing with each successive statement until it reaches the end of the program. This doesn't allow for very complicated programs. What if want to control the flow of execution, that is, what if we want one part of the program to be executed in some cases, but another part to be executed in different cases?
Conditional statements allow you to control how your program executes. For example, a conditional statement could apply a comparison operator to a variable, then execute a different block of statements depending on the result of the comparison. Or, it could cause a block of statements to be executed repeatedly until some condition is met.
Understanding condition statements is necessary for writing even moderately complicated programs. We discuss some common R conditional operators below, and give details on how to structure your code within a conditional statement.
To start, we'll discuss the if-then-else conditional. Described in simple terms, this is used in a program to say, "if some condition is true, then do this, else do that."
As an example, suppose we have a variable grade
that
holds a student's numeric grade on the range 0–100. We want to
define a new variable passed
that's set to
TRUE
if the student's grade is 50 or higher, or
FALSE
if the grade is less than 50. The following R
conditional will do this.
Although this statement appears simple, there are a number of important details to discuss.
grade >= 50
evaluates to
TRUE
if grade
's value is 50 or greater,
and FALSE
if it isn't. The if conditional uses this
boolean result to decide which part of the conditional statement to
execute.( )
. This guarantees it is evaluated before the if
statement uses its result.{ }
around the blocks
of code following the if and else statements. This identifies which
statements(s) make up each code block. Braces are required in
r except in the special circumstance where your if statement has
only a then
block with a single statement.Interestingly, the else part of the conditional is optional. The following code will produce the same result as the first example.
Suppose we wanted to not only define pass or fail, but also assign
a letter grade for the student. We could use a series of if-then
statements, one for each possible letter grade. A better way is to use
else if
, which defines else-if code blocks. Now, we're
telling a program, "if some condition is true, then
do this, else if some other condition is true, then
do this, else do that." you can include as many else-if
statements as you want in an if-then-else conditional.
Another common situation is the need to execute a code block until some condition is met. This is done with a while conditional. Here, we're telling the program "while some condition is true, do this." for example, suppose we wanted to print the square roots of values on the range 1–15.
Notice that the variable that's compared in the while conditional
normally must be updated in the conditional's code block. If
you don't update the conditional variable, a comparison that initially
evaluates to TRUE
will never evaluate to
FALSE
, which means the while loop will execute
forever. For example, consider the following code block.
Without the i <- i + 1
statement to update
i
in the conditional's code block, the while conditional
never fails, giving us the same output over and over. You can use
ctrl+c to halt your program if it's caught in an infinite loop like
this.
A final conditional that is very common is a for loop. Here, we're telling a program "execute this code block for some list of values." for can work on any list of values, but it's often applied to a numeric range. Numbers separated with a colon can be used to create an inclusive sequence of numerics in one-unit increments.
Specifying two values like 2:5
defines a starting value of 2
and an ending value of 5. This generates an integer list from the starting
value, up to and including the ending value: 2 3 4 5
. If you
want to increment by a value other than one, you can use the seq
function.
Once a list is produced with colon or seq
, each value
in the list is given to the for conditional's code block, in
order. For example, suppose we wanted to print the same set of square
roots from 1–15 using a for loop.
The for statement defines a variable to hold the "current" list
value. In our case, this variable is called
i
. 1:15
generates the list 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15
. The for conditional walks through this
list and executes the code block 15 times, first with i
set to 1, then with i
set to 2, and so on up to the final
value of 15. The statement inside the code block uses i
to track the current list value, printing square roots from 1 to
15.
We don't need to use colon or seq
to execute a for
conditional. Any vector or list can be used in a for loop.
break
Sometimes we need to exit a for or while loop before its condition
evaluates to FALSE
. The break
statement allows
us to do this. For example, suppose we wanted to print the elements of a
list of strings, but terminate examining the list if we see the string
stop
.
next
Other times, we want to stop executing a loop's code block, and instead
return to check its condition. The next
statement allows us
to do this. For example, suppose we wanted to print only the odd numbers
from 1 to 10.
Write a set of R statements to compute the average of the following list of numbers.
I recommend you write your program using RStudio
and then
test it, rather than writing the program directly in the R shell. This will
let you write your code, run it to see what it does, edit it to fix problems,
and run it again, without having to re-type the entire program at the command
line.
for loop
while loop
Notice that we have to convert the sum to a floating point value
(in our case, by casting it with float()
) to get the
proper average of 13.75. If we had used the statement print(
float sum / len( num ) )
instead, R would have return an
integer result of 13.
You can download the solution file and run it on your machine, if you want.
Inevitably, you'll write some R code that either doesn't do what you expect it to do, or that generates an error message when you try to execute it. When that happens, you'll need to debug the program to locate and correct the error. Consider the following code.
If you hit return to close the for loop, R would respond with an error message similar to this.
So, that didn't work. The error message shows the snippet of code that
caused the error, and what the error was. The important part of the error is
the attempt to explain the problem R encountered. This explanation suggests
that R doesn't know how to add (+
) non-numeric arguments.
If you look at where the error was reported, it attempted to execute
sum <- sum + val
. R is claiming one of the variables
sum
or val
is non-numeric. Indeed, sum
,
is a numeric, but the second variable val
is a character.
val
is a value from the vector l
. And, when we look
at l
, we see that it contains three character values:
"10"
, "20"
, and "30"
. This is the
problem that R encountered.
There are various ways to fix this problem. One simple solution is to put
numerics in the vector, l <- c( 10, 20, 30 )
. If you wanted
l
to contain strings for some reason, you could convert
val
to be an integer in the add operation using
as.numeric()
.
Now, R accepts the for loop's body because it understands how to add to numeric variables. The resulting sum is printed after the loop finishes.
One important operation when using R is to read and write data to and from external files. R uses file input/output (file IO) operations to support this. the most common read operations import a text file formatted as a table, or import a text file stored in comma-separated value (csv) format.
Notice that in both cases the result is stored as a
data.frame
. By default, the first line of the file is assumed to
contain column names, and the first entry of each subsequent line is assumed
to be a row name. If the file has no header line, specify
header=FALSE
as an argument in read.table
or
read.csv
. Columns will be given generic names V1 V2
… Vn
. If the rows are not labelled, specify row.names <-
NULL
. Rows will be numbered starting at 1.
If you want to define the column or row names after the file is read you
can use the names
or rownames
functions. Finally, if
your file uses a separator character (a delimiter) other than comma, there is
a read.delim
function to allow you to read a delimited file. You
must define the delimiter with the sep
argument.
Writing data frames to an output file is similar. R provides the
write.table
and write.csv
functions to write
data frames as tabular or csv data. You provide the data frame to
write and the path and name of the file to create, for example,
write.table( aq_tbl, file="w:/msa/r/airquality-new.txt"
)
.
It's possible to write a program as a single, long sequence of statements in the main module. Even for small programs, however, this isn't efficient. First, writing a program this way makes it difficult to examine and understand. Second, if you're performing common operation on different variables, you need to duplicate the code every time you perform that operation.
For example, supposed we wanted to report the maximum of two
numeric lists l
and m
. One obvious way to do
it is to write two for loops.
This has a number of problems, however. What if we had more than just two lists we wanted to query? we'd need to duplicate the for loop once for each list. What if we wanted to do something more complicated than calculating the maximum (e.g., what if we wanted variance instead)? the amount of code we'd need to duplicate would be much longer.
What we really want to do is to have some sort of
max_val()
operation that we can call whenever we want to
calculate the maximum value of a numeric list.
In R we can define a function to create new operations like
max_val()
. A function is defined by a function name, the keyword
function
, an optional argument list in parentheses, and then a
function code block that defines what the function does when it's called.
Functions can take zero or more arguments. A function with no
arguments still needs open and close parentheses, func <-
function()
. A function with multiple arguments separates then
with commas, func <- function( a, b )
. Once a function
is defined, it can be used anywhere, including in other
functions. Suppose we now wanted to write a function
max_val_list()
to compute the maximum value from a list
of numeric lists. We can use our max_val()
function to
help to do this.
It's even possible for functions to call themselves. This is known as recursion. The classic example of recursion is the fibonacci sequence. However, we'll demonstrate recursion by developing a recursive algorithm to solve Sudoku, a puzzle where a \(9 \times 9\) grid partially filled with numbers from 1 to 9 is completed based on the following rules.
Two common approaches to solve a Sudoku puzzle are brute force and backtracking, each of which vary in simplicity and efficiency.
We won't go over the backtracking solution to Sudoku, but you can look at one implementation below. Remember, the efficiency of the solution will depend on how many numbers you provide in the initial board. The fewer the known numbers, the longer it will take to locate the solution.
The tidyverse is a relatively new set of packages designed for performing data since in R. Functionality provided by tidyverse and its constituent packages are purpose-built to perform common data science tasks in a consistent and efficient manner. tidyverse is made up of a number of core packages, each designed to address a common need in data analytics.
dplyr.
A data manipulation package (cheat sheet).forcats.
A categorical variable management
package (cheat sheet).ggplot2.
A visualization package (cheat sheet).lubridate.
A date–time management package
(cheat sheet).purrr.
A package to manage functions and vectors
(cheat sheet).readr.
A package to simplify reading rectangular
data from delimited files (cheat sheet).stringr.
A set of operations for string
manipulation (cheat sheet).tibble.
A modern reinterpretation of a data
frame.tidyr.
A package of data wrangling operations to
standardize data storage within tidyverse (cheat sheet).The designers of tidyverse recommend using the book "R for data science, 2nd edition" as a tutorial and learning reference. The entire content of the book is also available online. Cheat sheets for each package that give an overview of the package's purpose and available functions are linked in the list above.
Since operations in tidyverse are performed on tibbles, it is important to understand what a tibble is and how it works. In simple terms, tibbles are extensions of data frames that are meant to make working with rows and columns of data easier.
The "R for data science" book highlights two main differences between tibbles and data frames: printing and subsetting. Tibbles redefine the print statement in the following ways.
str()
.Subsetting works in a very similar manner to data frames, which is not
surprising since tibbles are built on top of data frames. You can index
individual columns using $
, [[...]]
, or by a
specific column position starting at 1.
~
and .
operatorsThe tilde (or twiddle) ~
operator separates or specifies
arguments to a formula. For example, lm(wages ~ yearsed,
data=df)
would take a data frame df
and perform linear
regression using a predictor \(x\) of yearsed
and a response
\(y\) of wages
(i.e., what is the linear relationship
between years of education and wages?)
The dot .
operator is often used as shorthand for "all
columns". So aggregate(data=iris, . ~ Species, mean)
would return
the mean for every column in the iris
data frame, grouped by iris
species.
%>%
OperatorThe pipe operator %>%
allows chaining sequences of
operations and applying them to a variable in a single statement, rather than
using multiple statements. This is analogous to composition in mathematics.
suppose we had a dataset \(x\) and two functions \(f()\) and \(g()\). If we
wanted to apply \(f\) to \(x\), then apply \(g\) to the result we could either
create a temporary result variable: \(r = f(x), g( r )\) or we could
compose the two operations into a single statement: \(g(f(x)))\).
pipes are used to perform a similar type of composition in r: print( x
%>% f %>% g )
. here, we take the dataset x
,
pass it to function f()
, then take the result and pass
that to function g()
, printing the result.
Suppose you wanted to take the iris
dataset, extract rows with
a Sepal.Length
greater than 6, then aggregate the mean of
Petal.Length
and Petal.Width grouped by
Species.
Notice the statement ".[ tib$Sepal.Length > 6,
]
". For functions that do not expect explicit input data, we
use dot to refer to data from the previous step in the pipeline. In
this case, the index operator does not expect explicit data
like aggregate()
does, so we use
.
to tell R that we are indexing the previous data element in the
pipeline, which is the tib
tibble.
In tidyverse the different packages are designed to expect tidy data. This has a specific meaning.
Making your data tidy has the added advantage of enforcing a common structure to each dataset. This makes them easier to understand and work with.
The tidyr
package contains a variety of functions to take data
and manipulate it into a tidy form. This is also known as data preprocessing
or data wrangling. Once you start working with real data, you'll
quickly discover it is rarely tidy, so a set of functions that make it easy to
wrangle raw data into a tidy form is invaluable. tidyr
provides
functions to handle the following common cases.
na
or
implicitly identified by simply not being present in the dataset.To handle variables spread over multiple columns or observations stored in
multiple rows, we pivot the data using pivot_longer()
or
pivot_wider()
, respectively. Consider the following dataset.
Here, year should be one column, and the value for a (country,year) pair
should be a second column. Instead, individual years have been stored as
separate columns. To correct this, we use pivot_longer()
.
pivot_longer()
requires you to define which columns are values
and not variables (the years 1999 and 2000, in our example), the name of the
new column to move the column names to, and the name of the new column to move
corresponding column values to. Assuming tib
holds the original
data, the following pivot_longer()
command re-organizes the
tibble to have separate year and cases columns.
Notice that the original columns `1999`
and
`2000`
are replaced with year
and
cases
. This makes the tibble longer, which explains the
name of the pivot command. pivot_longer
has converted
cases
to a double, but years
is a character.
if you wanted it to be numeric, you can use the
names_transform
argument.
The opposite situation occurs when data for a single observations or sample
is spread over multiple rows. In this case, pivot_wider()
is used
to collect data for a common observation and store it in a single tibble row.
consider the following tibble, with data for individual countries spread over
four separate rows.
Here, we need Cases
and Population
to be
columns, rather than separate rows. To do this, we
use pivot_wider()
and provide the columns to take the new
variable names from (type
in the current tibble) and the
column to take the corresponding values from (count
).
Another extremely common issue with raw data is missing values. These can
either be explicit, where na
or some other delimiter
indicates a value is missing, or implicit, where the value is simply
not present in any form in the dataset. This leads to two important questions.
first, how can we convert implicit missing values into an explicit form so we
know they exist? second, how should we manage explicit missing values?
consider the following tibble with an explicit missing value for 2015, q4
(denoted as na
) and an implicit missing values for 2016, q1 (not
present in the tibble).
To identify implicit missing values we can use pivot_wider()
to put qtr
in one column and individual years in additional
columns. This will automatically insert NA
into the tibble for
any (qtr,year) pair without a value.
Now the first row shows the (previously implicit) NA
for q1 2016. However, we know from previous discussion that this is
not a tidy representation of the data. Year values should be in an
individual column, not stored as separate columns. As before, we can
correct this with a follow-on pivot_longer()
.
Notice we have also opted to remove NA
s from the final tibble
with the option values_drop_na = TRUE
in the
pivot_longer()
function.
dplyr
Once you have your data in a "tidy" format, you can manipulate it with
functions from the dplyr
package. dplyr
allows you
to select, filter, modify, reorder, and summarize tibbles to focus and
rearrange them, and to add new columns based on values in existing columns. A
set of single table verbs are defined to manipulate rows, columns, or
groups of rows.
filter()
: choose rows based on column value(s).slice()
: choose rows based on location.arrange()
: change row order.select()
: chooses a subset of columns.rename()
: renames a column.mutate()
: changes the values of a column or creates
new column(s).relocate()
: changes the order or columns.summarize()
: collapse a group of rows into a single
row.The code block below demonstrates each of these
dplyr
functions applied to the starwars
tibble.
Although there are additional functions in the
tidyverse
foundation packages, we will not continue
exploring them in this introduction. the basics of data wrangling,
data loading, and data manipulation have all been covered through base
R and the tidyr
, dpylr
, and
stringr
packages. Additional information is available
either online or in the "r for data science" textbook if you need to
make use of any of the other packages.
We have not explored ggplot2
for visualization. This
will be done later in the fall semester during the visualization
module. We discuss building dashboards with R+shiny, which starts with
a detailed overview of ggplot2
.
R markdown is a notebook-like interface that allows you to combine plain text, R code, and results from executing the code in a single file. R markdown files are edited in rstudio, and are meant to provide both analysis in R and explanations of the analysis in a corresponding text block.
An R markdown file uses markdown syntax, a simple way of
annotating a text file to provide basic formatting. For example, to
"mark" text as a header, you precede text with hashtags, one for each
heading level you want (e.g., # heading level 1
,
# heading level 2
, and so on). Bold text is surrounded by
double asterisks, and italic text is surrounded by single underscores
(e.g., **bold**
and
_italic_
). R markdown supports headers of different levels; bold and italic text;
blockquotes; ordered and unordered lists; images; tables; urls; and R
code blocks.
pandoc
markdown-to-markdown converter
Once your R Markdown file is complete and running correctly, you knit the markdown file (Ctrl+Shift+K or the knit button at the top of the R Markdown editing pane). This converts the markdown file into an HTML document, a PDF document, or Microsoft Word output. The output is a formatted version of the markdown file with all code blocks executed and their output included. This is mean as a way to share the results of executing the R Markdown file.
When you create a new R Markdown file or load a markdown file (R Markdown
files normally have a file extension of Rmd
), the programming
window in RStudio switches to a notebook format, showing code blocks with
light grey backgrounds and explanatory text and code output with white
backgrounds. Each code block has three icons in the upper-right corner: a gear
to set options for the code block, a downward arrow and rectangle to execute
all code blocks prior to the given code block, and a rightward arrow to
execute the code block. This highlights an important property of R Markdown
files: when a code block is executed, any variables or other results it
generates are saved for use in subsequent code blocks. This means you often
need to run previous code blocks to "setup" the environment to have the
information needed to run the current code block. It also means the code
blocks roughly correspond to an R program, divided into logical "code blocks"
to divide the program into meaning chunks of execution.
Every R Markdown file starts with a YAML (Yet Another Markdown Language) header that normally defines the markdown file's title, author, and date of modification, although additional information can be included if needed. YAML headers start and end with three consecutive hyphens.
R Studio will automatically insert a default YAML header with Title, Author, and Date when you create a new R Markdown file. Given this introduction, use the following steps in R Studio to create a default R Markdown file.
echo
option in the previous code block.At this point you're ready to start building your project. Usually, this involves editing the first R code block to set any project global options you need, changing the text in the first markdown block to replace the default text with something appropriate to your project, then erasing everything after the first R code block. The first code block can then be modified to perform the operations you want to execute at the beginning of your project.
R code blocks are surrounded by triple apostropheses ```
and
```
. Between these delimiters you first name the code block and
specify any options specific to the block, then enter the code to execute
within the block.
Executing this block would produce the following output block.
A tibble: 10 × 14 |
Like Jupyter Notebook, R Markdown code blocks maintain notebook state. This
means you could load tidyverse
in one code block, then use it in a
subsequent block. However, this also means if you change a code block
you need to re-run it for that change to take effect and propegate to follow-on
code blocks. A common problem programmers have is remembering which code blocks
they've changed, or the fact that the code block they're running isn't working
because a previous code block was updated but not re-run. R will not tell you
this. Consider, for example, an initial code block that loads data into a tibble
from a file. This tibble is used in follow-on code blocks. If you change the
file used to populate the tibble but forget to execute the code block, follow-on
code blocks will continue to work, but they will be using old data because the
tibble's contents has not been updated. Take care to make sure you're working in
an up-to-date environment when using R Markdown.