Introduction

In this tutorial we will investigate the R programming language. Like any computer program, R is used to define a sequence of instructions that combine together to solve a problem or produce a desired result. Computer programs range from very simple examples with only a few lines of source code, to very complicated. For example, Photoshop CS 6 is estimated to contain around 4.5 million lines of code.

In general, programming involves the following steps:

  1. A program is written in some programming language (e.g., C++, Python, R, or Javascript), usually stored in one or more source code files.
  2. For compiled languages like C or C++, the source code is converted into machine language by a compiler, then combined into an executable file by a linker. The executable file is run on a target machine and operating system.
  3. For interpreted languages like Python and R, the source code is processed by an interpreter. Individual lines are converted to machine language and executed one by one, as they are encountered within the source code.

This tutorial will provide an introduction to programming R, an interpreted programming language. R was conceived by Ross Ihaka and Robert Gentleman at the University of Aukland in 1992. R 1.0 was released in 2000, and today R is developed by the R Core Development Team. We will be using R version 4.4.x.

Since R is an interpreted language, individual lines of code are converted to machine language and executed as they are encountered (versus a compiled language, which converts an entire program to machine code in an explicit compile–link stage). One advantage of an interpreted language is the ability to enter individual commands on a command line prompt, and immediately see their results.

% print( 7 + 3 ) [1] 10 % l <- c( "a", "b", "c" ) % print( l ) [1] "a" "b" "c" % print( l[ 1 ] ) [1] "a" % zip_code <- c( "healey"=27695 , "rappa"=27606 ) % print( zip_code ) healey rappa 27695 27606 % print( zip_code[ "healey" ] ) healey 27695 % zoo <- c( "ocelot", "aardvark", "puma", "aardvark", "dalmatian", "ocelot" ) % animals <- unique( zoo ) % print( animals ) [1] "ocelot" "aardvark" "puma" "dalmatian" % print( "dalmatian" %in% animals ) [1] TRUE % print( "parrot" %in% animals ) [1] FALSE

Unless we're only issuing a few commands once or twice, we normally store the commands in source code files. This allows us to load and execute the commands as often as we want. It also makes it easier to modify a program, or correct it when we discover errors. A version of the above code is available in the source code file tut-01-intro.R.

Running R

For this class, we'll be using RStudio Desktop, which combines the current version of R and the RStudio IDE (integrated development environment). The most recent version of R can be downloaded online, as can RStudio Desktop. The base installation of R includes many of the standard packages we will use throughout this tutorial. We will also show you how to install and load any additional packages you might need in your R programs.

Why R?

R is open source and used extensively in the academic and industrial world for statistical analysis. Its core architecture is maintained by the R Core Development Team, and various individuals have created packages to extend its capabilities to support many complex statistical algorithms. For example, the time series and Bayesian capabilities of R far exceed those found in other, more general languages like Python.

In the general pipeline of operations to convert raw data intro a set of analytical results, we often assume the steps of collection & storagepreprocessinganalysispresentation. R falls within the analysis step of this pipeline.

Labs and Project

Each day one lab will be available for you to use to test your knowledge of R. The lab opens at the beginning of the given day, and closes at 11:59pm on the same day.

Your homework team will also be required to complete a project. The project will be submitted after the R instruction is complete, allowing you to work on it with full knowledge of the material covered in Introduction to R. Dates you can submit the project are listed on the Moodle web page. Pay careful attention to the description of the project. You MUST submit the project as an R Markdown file.

You are free to use whatever resources you want to complete the labs and project, but remember, these are meant to help you gain proficiency in R. During the R assessment, you will not have time to read notes or query the Internet and complete all the questions you are asked. Because of this, I strongly encourage you to learn the material in the labs and project independent of external resources. It is fine to look at things like specific function names or explanations of arguments to a function, but do not expect to be able to read how a topic works and still have time to complete the assessment.

Grades for the labs and the project will count, along with your assessment, towards your final grade for the R component of the Summer 2 module. For the Introduction to R labs and project, the following grade breakdown will be used.

Installing Packages

R comes pre-installed with numerous basic packages. However, you will eventually need packages that are not included in the base installation. When this happens, you will need to install the package using install.packages( "package-name" ) where package-name is the name of the package you want to install. An example is tidyverse, which is not included with the base R installation. We can include it by issuing the command install.packages( "tidyverse" ) at the R command line.

To list all installed packages in your R installation, you can use the command installed.packages()[, c(1,3:4) ] To determine if a package is installed, use the command "package-nm" %in% as.list( installed.packages()[, c(1)] ), which will return TRUE or FALSE depending on whether a package with the name "package-nm" is installed.

% installed.packages()[, c(1,3:4] ] Package Version Priority AsioHeaders "AsioHeaders" "1.22.1-2" NA askpass "askpass" "1.2.0" NA backports "backports" "1.5.0" NA base64enc "base64enc" "0.1-3" NA … tcltk "tcltk" "4.4.0" "base" tools "tools" "4.4.0" "base" translations "translations" "4.4.0" NA utils "utils" "4.4.0" "base" % "tidyverse" %in% installed.packages()[, c(1) ] … [1] FALSE % install.packages( "tidyverse" ) % "tidyverse" %in% installed.packages()[, c(1) ] … [1] TRUE

Finally, if you only want to see the packages you have attached with the library() command, or those that are automatically attached to every R session, use the command search().

% search() [1] ".GlobalEnv" "package:lubridate" "package:forcats" [4] "package:stringr" "package:dplyr" "package:purrr" [7] "package:readr" "package:tidyr" "package:tibble" [10] "package:ggplot2" "package:tidyverse" "package:stats" [13] "package:graphics" "package:grDevices" "package:utils" [16] "package:datasets" "package:methods" "Autoloads" [19] "package:base"

Variables

Every programming language provides a way to maintain values as a program runs. Usually, this is done by create a named variable, then assigning a value to the variable. In R, variables are created interactively by specifying their name and assigning them an initial value.

% name <- "Calvin Coolidge" % height <- 5.10 % age <- 60 % birthplace <- "Plymouth Notch VT" % born <- "July 4 1872" % deceased <- "January 5 1923"

Unlike languages like C++, R does not require you to specify a variable's type. This is inferred from the value it maintains. In the above example the variables name, birthplace, born, and deceased are inferred to be characters, and height and age are inferred to be numeric. One advantage of R's dynamically typed variables is that you can change them to hold different types of values whenever you want. You can also ask R what type of value a variable contains with the class() function.

% name <- "Calvin Coolidge" % print( class( name ) ) [1] "character" % name <- 25 % print( class( name ) ) [1] "numeric" % name <- 6.3 % print( class( name ) ) [1] "numeric" % name <- 305127925769258727938193819283 % print( class( name ) ) [1] "numeric" % name <- FALSE % print( class( name ) ) [1] "logical"

For a variety of reasons, including the object-oriented abilities of R and its heritage as an extension of S and S-Plus, there are many different ways to ask about a variable's "type" from different perspectives. Some will provide the same answer and some will differ, depending on the variable being queried. Consider the following example.

% val <- 3.14 % print( class( val ) ) [1] "numeric" % print( typeof( val ) ) [1] "double" % str( val ) # Don't use print() b/c str() returns NULL, and % # that will be printed along with str()'s output num 3.14 % print( mode( val ) ) [1] "numeric"

class() returns the variable's "type" from an object-oriented point of view. typeof() return the type from R's point of view. str() returns a compact representation of the structure of the variable. Finally, mode() return the type based on the Becker–Chambers–Wilks reference, which is normally defining how the variable is stored in memory. If you find this confusing, there are even more ways to inspect a variable's internal structure. Most programmers recommend using str() as the default method to inspect a variable's type.

Here is a quick list of some of R's basic variable types. More complicated types will be discussed later in the tutorial.

A note on vectors. In many languages, a vector's components maintain their original type when they are added to the vector. For example, you might expect c( "Hello", 3.14, 3+9i ) to contain three types, respectively: a character (or string), a numeric (or double), and a complex. In R, however, all the entries are converted to a common type based on a type hierarchy where R chooses the highest type of variable in the vector. The type hierarchy from highest to lowest is expression > list > character > complex > double > integer > logical > raw. Given this, our vector c( "Hello", 3.14, 3+9i ) made up of character, double, and complex has the highest type hierarchy of character, so all entries are converted to characters.

% vec <- c( "Hello", 3.14, 3+9i ) % print( class( vec ) ) [1] "character" % print( vec ) [1] "Hello" "3.14" "3+9i" % print( class( vec ) ) [1] "complex" % vec <- c( 3+9i, 6 ) % print( vec ) [1] 3+9i 6+0i

Variable Practice Problem

Write a set of R statements that assign the associated group names for the following animals: Beaver: colony; Crow: murder; Parrot: pandemonium; and Porcupine: prickle to R variables, then prints four lines listing each animal and corresponding group name.

I recommend you write your program using RStudio, save it as a R source code file, and then test it, rather than writing the program directly in the R shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Variable Assignment Solution

% animal_0 <- "Beaver" % group_0 <- "colony" % animal_1 <- "Crow" % group_1 <- "murder" % animal_2 <- "Parrot" % group_2 <- "pandemonium" % animal_3 <- "Porcupine" % group_3 <- "prickle" % % print( paste( animal_0, ":", group_0, sep=" " ) ) [1] "Beaver : colony" % % print( paste( animal_1, ":", group_1, sep=" " ) ) [1] "Crow : murder" % % print( paste( animal_2, ":", group_2, sep=" " ) ) [1] "Parrot : pandemonium" % % print( paste( animal_3, ":", group_3, sep=" " ) ) [1] "Porcupine : prickle"

You can download the solution file and run it on your machine, if you want.

Your choice of variable names is probably different than ours, and you might have printed the name and phone number with slightly different formatting. Regardless, the basic idea is to use eight separate variables to store the names and phone numbers, then print the contents of these variables in combinations that produce the correct output.

You might think, "This works, but it doesn't seem very efficient." That's true. Once you've learned more about R, it's unlikely you'd write this code to solve the problem. Here's a more elegant and flexible solution. When you've finished the tutorial, you'll be able to understand, and to implement, this type of code.

% db <- new.env() % db[["colony"]] <- "Beaver" % db[["murder"]] <- "Crow" % db[["pandemonium"]] <- "Parrot" % db[["prickle"]] <- "Porcupine" % for ( nm in names( db ) ) { % print( paste( db[[nm]], ":", nm, sep=" " ) ) % } [1] "Beaver : colony" [1] "Crow : murder" [1] "Parrot : pandemonium" [1] "Porcupine : prickle"

Operators

R provides a set of built-in functions or operators to perform simple operations such as addition, subtraction, comparison, and boolean logic. An expression is a combination of variables, constants, and operators. Every expression has a result. Operators in R have precedence associated with them. This means expressions using operators are not evaluated strictly left to right. Results from the operators with the highest precedence are computed first. Consider the following simple R expression:

% 6 + 3 * 4 / 2 + 2 [1] 14

If this were evaluated left to right, the result would be 20. However, since multiplication and division have a higher precedence than addition in R, the result returned is 14, computed as follows.

Of course, we can use parentheses to force a result of 20, if that's what we wanted, with the following expression:

% ((( 6 + 3 ) * 4 ) / 2 ) + 2 [1] 20

Below is a list of the common operators in R, along with an explanation of what they do. The operators are group according to precedence, from highest to lowest.

Operator Description
( ) parentheses define the order in which groups of operators should be evaluated
^ exponential
+x, -x make positive, make negative
%%, %/% remainder (modulus), integer division (e.g., 5 %% 3 returns 2, 5 %/% 3 returns 1)
*, / multiplication, division
+, - addition, subtraction
<, <=, >, >=, !=, == less, less or equal, greater, greater or equal, not equal, equal
&, && vectorized logical AND, unary logical AND
|, || vectorized logical OR, unary logical OR
->, ->> rightward assignment
<-, <<- leftward assignment
= leftward assignment
Order of precedence for R operators

In addition to R operators, a number of common math functions are included in the base environment.

Math Function Description
abs( x ) Absolute value of x
sqrt( x ) Square root of x
ceiling( x ) Smallest integer larger than x
floor( x ) Largest integer smaller than or equal to x
trunc( x ) Truncate x
round( x, digits=n ) Round x to n digits of precision (e.g., round( 3.14159, digits=1 ) returns 3.1)
signif( x, digits=n ) Maintain n significant digits in x (e.g., signif( 3.14159, digits=3 ) returns 3.14, signif( 314.159, digits=1 ) returns 300)
cos( x ), sin( x ), tan( x ) Return cosine, sin, tangent of x (NB: x is specified in radians)
log( x ) Return natural log of x
log10( x ) Return log base 10 of x
exp( x ) Return the exponent of x
sum( x, na.rm=[TRUE | FALSE] ) Sum the values in vector x.
By default NAs are not removed and sum will return NA is one or more exist in x. Specify na.rm=TRUE to ignore NAs
R Base math functions

Advanced Data Types

In addition to boolean and numeric variables, R provides a number of more complex types, including characters (or strings), lists, dictionaries, factors, matrices, and data frames. Using these types effectively will make you a much more efficient programmer.

character

Character variables are a sequence of one or more characters. Character values are denoted by double quotes, s <- "abraham lincoln", or single quotes, s <- 'abraham lincoln'. Because characters can be a sequence of multiple character values, they support more sophisticated operations. Here are some common operations you can perform on characters.

Here are some examples of string operations executed in an R shell.

% s <- "hello world!" % print( nchar( s ) ) [1] 12 % print( substring( s, 7, 7 ) ) [1] "w" % print( substring( s, 3, 8 ) ) [1] "llo wo" % print( substring( s, 4 ) ) [1] "lo world!" % print( substring( s, nchar( s ) - 1, nchar( s ) - 1 ) ) [1] "d" % print( substring( s, nchar( s ) - 2 ) ) [1] "ld!" % t <- "must.. try.. harder.." % print( paste( s, t ) ) [1] "hello world! must.. try.. harder.."

There are some additional operations you can perform on strings, for example, toupper( s ) to capitalize a string, or strsplit( s, split=" " ) to subdivide s based on the split character. The R documentation enumerates the available string operations. As you can probably tell, base R has limited string manipulation. For a more convenient and complete set of string functions, consider a package like the stringr package, which is included with the tidyverse package (install tidyverse with the command install.packages( "tidyverse" ), then load the package with the command library( tidyverse ) to access its functionality).

list

list variables are ordered sequences of values. Unlike a vector, different data types can be stored in a list, for example, numerics, characters, or even lists themselves.

Lists are known as recursive objects in R, whereas vectors are considered atomic. This affects how different functions process their arguments. Another interesting property of lists is that they are often made up of names and corresponding values (NB: vectors can use name–value association pairs as well, although their keys and values will use the type hierarchy to ensure a common type). This is similar to dictionaries, discussed below, which store key–value pairs in ways that make searching for specific keys and their values very efficient. Lists are not efficient, unfortunately, since they use name lookup to match names to their corresponding values.

Because lists can contain different types of values, the way you index into a list to get or set values is different than vectors. If we create a vector v <- c( 1, 2, 3 ) then we can retrieve the i-th element using c[ i ], for example, c[ 2 ] which returns2. Below are examples of creating, accessing, and examining the structure of a list in different ways.

% l <- list("values"=sin(1:3), "ids"=letters[1:3], "sub"=list("foo"=42,"bar"=13), "greeting"="Hello") % str( l ) List of 4 $ values : num [1:3] 0.841 0.909 0.141 $ ids : chr [1:3] "a" "b" "c" $ sub :List of 2 ..$ foo: num 42 ..$ bar: num 13 $ greeting: chr "Hello" % print( l$values ) [1] 0.8414710 0.9092974 0.1411200 % print( l[["ids"]] ) [1] "a" "b" "c" % print( l$sub$foo ) [1] 42 % print( l[[c(3,2)]] ) [1] 13

As with strings, there are many additional operations you can perform on lists. To add an entry to a list, you can specify either its name or its index: l["Greeting"]="Hello" or l[4]="Hello". Notice that in the second case with an index value, the new list entry will have no name associated with it. To remove an entry, set it to NULL: l[4] <- NULL. To query the names of each entry, use the names() function: names(l). Additional operations exist to modify list elements and convert them into different data types.

A note on lists. One important property of a list is the type of result returned during indexing with square brackets. l["greeting"] and l[["greeting"]] both return "Hello", but if we look at their class, we see that class(l["greeting"]) is "list", whereas class(l[["greeting"]]) is "character". Single brackets return a sublist. Double brackets return a value. Consider the following code to see if you understand how this works.

% sublist <- l[ 1 ] # Return first entry of l as list % val <- l[[ 1 ]] # Return first entry of l as value % str(sublist) List of 1 $ values: num [1:3] 0.841 0.909 0.141 % str(val) num [1:3] 0.841 0.909 0.141 % sublist[1] $values [1] 0.8414710 0.9092974 0.1411200 % val[1] [1] 0.841471

dict

Dictionary variables are a collection of key–value pairs. This is meant to be analogous to a real dictionary, where the key is a word, and the associated value is the word's definition. Dictionaries are designed to support efficient searching for elements in the dictionary based on key.

In R, a specific dictionary data type does not exist. However, it is possible to use R environments to mimic dictionaries, since environments are built using a standard dictionary data structure. The Variable Assignment Solution above shows how to use environments to do this. Semantically, however, this can be confusing, particularly for new R users. An alternative is to use a package that provides hash tables. Numerous packages exist, but we will use the r2r package, which provides both hashmap (hash tables) and hashset, an implement of a mathematical set which can also be efficiently programmed using a hash table.

To install the r2r package, you will need to install the devtools package and use its github interface to download and install r2r.

% install.packages( "devtools" ) # Install devtools package % # Messages as devtools installs % % devtools::install_github( "vgherard/r2r" ) # Install r2r using devtools % # Messages as r2r installs % % library( r2r ) # Make r2r available

By design, dictionaries have one important requirement: every value you store in a dictionary must have its own, unique key. For example, we could not store a person's address using their last name as a key, because if two different people had the same last name, only one of their addresses could be saved in the dictionary.

To re-implement the animal–group name problem using r2r's hashmap, we would use the following R code.

% library(r2r) % group_nm <- hashmap() % insert( group_nm, "Beaver", "colony" ) % insert( group_nm, "Crow", "murder" ) % insert( group_nm, "Parrot", "pandemonium" ) % insert( group_nm, "Porcupine", "prickle" ) % print( keys( group_nm ) ) [[1]] [1] "Beaver" [[2]] [1] "Parrot" [[3]] [1] "Porcupine" [[4]] [1] "Crow" % print( values( group_nm ) ) [[1]] [1] "colony" [[2]] [1] "pandemonium" [[3]] [1] "prickle" [[4]] [1] "murder" % query( group_nm, "Beaver" ) [1] "colony" % query( group_nm, "Flamingo" ) # It's flamboyance, but not in dict NULL % print( group_nm %has_key% "Beaver" ) [1] TRUE % print( group_nm %has_key% "Flamingo" ) [1] FALSE

The first statement creates a dictionary variable named group_nm and assigns the four animal–group name pairs. The next two lines asks for the keys and corresponding values stored in group_nm. Next, we query the value associated with the key Beaver, which is in group_nm, and Flamingo, which is not. Notice that the query for Flamingo returns NULL to indicate the key is not in group_nm. The final two lines show how to use the operator %in% to see if a given key is in the dictionary. The full documentation for r2r is available online.

Technical Aside: Dictionaries are a very powerful data structure. If you need to perform efficient search, if the ordering of the element's isn't critical, and if you can define a key for each of the elements you're storing, a hashmap might be a good candidate.

What does it mean to say "Dictionaries are fast?" In computing terms, we measure speed using order notation \(\textrm{O}\). Lookup, insertion, and deletion in a dictionary are \(\textrm{O}(1)\), but lookup and deletion in a list or vector are \(\textrm{O}(n)\) for a set of \(n\) values. In simple terms, this means the time for operations on a dictionary are constant no matter how big the dictionary is, but the operations on a list or vector are proportional to the size of the list or vector. If you double a list's size, on average it takes about twice as long to find or delete a value.

factor

A factor is a finite set of categorical values or levels. Factors are used to define a variable to be set to one of the allowable categories.

% v <- c("female", "male", "male", "female" ) % gender <- factor( v ) % print( gender ) [1] female male male female Levels: female male

Notice that R has examined the vector we convert to a factor, and automatically inferred all the unique values to determine the levels of the factor. It is possible to define the levels directly, for example, in cases where levels exist that are not part of the initial vector being converted to a factor.

% v <- c("female", "male", "male", "female" ) % gender <- factor( v, levels=c("female", "male", "transgender") ) % print( gender ) [1] female male male female Levels: female male transgender

Internally, factors are stored as integers starting at 1. each integer corresponds to one of the factor's category values.

% v <- c("female", "male", "male", "female" ) % gender <- factor( v, levels=c("female", "male", "transgender") ) % str( gender ) Factor w/ 3 levels "female", "male",..: 1 2 2 1

If you want to name the initial values in the vector being converted to a factor, you can optionally specific labels for each category. For example, suppose we had four possible birth cities.

  1. Dublin
  2. London
  3. Sofia
  4. Ponteverdra

Consider the following code snippet, which first uses the city values directly, then maps a label to each city value.

% city <- c(3, 2, 1, 4, 3, 2) % f <- factor( city ) % str( f ) Factor w/ 4 levels "1","2","3","4": 3 2 1 4 3 2 % city_nm <- c( "Dublin", "London", "Sofia", "Ponteverdra" ) % f <- factor( f, labels=city_nm ) % str( f ) Factor w/ 4 levels "Dublin","London",..: 3 2 1 4 3 2 % print( f ) [1] Sofia London Dublin Ponteverdra Sofia London Levels: Dublin London Sofia Ponteverdra

Once a factor variable is created, it is possible to update or extend its levels, for example, by executing levels(f) <- c(levels(f), "Zurich"). This will change f's levels to Dublin London Sofia Ponteverdra Zurich.

You can also add values to a factor and automatically have its levels update using append(), similar to vectors and lists. All the arguments to append() must be factors for this to work, however. If you simply try to append() any new value to a factor, it converts to a vector.

% city <- c(3, 2, 1, 4, 3, 2) % city_nm <- c( "Dublin", "London", "Sofia", "Ponteverdra" ) % f <- factor( city, labels=city_nm ) % print( f ) [1] Sofia London Dublin Ponteverdra Sofia London Levels: Dublin London Sofia Ponteverdra % new_f <- append( f, as.factor( c( "Zurich", "Berlin" ) ) ) % print( new_f ) [1] Sofia London Dublin Ponteverdra Sofia London [7] Zurich Berlin Levels: Dublin London Sofia Ponteverdra Berlin Zurich % new_f <- append( f, c( "Zurich", "Berlin" ) ) % print( new_f ) [1] "3" "2" "1" "4" "3" "2" "Zurich" "Berlin"

Matrix

A matrix is a 2-dimensional data table with rows and columns. Like a 1-dimensional vector, all the values in a matrix must be of the same type. since matrix data is normally provided as a vector, the same type conversion hierarchy for vectors will be used to convert all data to a common type if different types of data are provided. A matrix is created by optionally specifying its data and its number of rows and columns.

% mat <- matrix( c( 1, 2, 3, 4, 5, 6 ), nrow=3, ncol=2 ) % print( mat ) [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6

Matrix items are accessed via indexing, providing either a row, a column, or both. The row and column specifies can be vectors, allowing you to retrieve multiple rows or columns at once.

% print( mat[ 1, 1 ] ) [1] 1 % print( mat[ 2, ] ) [1] 2 5 % print( mat[ ,1 ] ) [1] 1 2 3 % print( mat[ c(2,3), ] ) [,1] [,2] [1,] 2 5 [2,] 3 6

New rows and columns can be bound to an existing matrix using rbind() and /cbind().

% mat <- cbind( mat, c( 7, 8, 9 ), c( 10, 11,12 ) ) % print( mat ) [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 % mat <- rbind( mat, c( -1, -2, -3, -4 ) ) % print( mat ) [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 [4,] -1 -2 -3 -4

You can also remove rows and columns using -c( i ) where i identifies the row or column you want to remove.

% mat <- mat[ -c( 2 ), ] % print( mat ) [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 3 6 9 12 [3,] -1 -2 -3 -4 % mat <- mat[ ,-c( 4 ) ] % print( mat ) [,1] [,2] [,3] [1,] 1 4 7 [2,] 3 6 9 [3,] -1 -2 -3

The dim() function returns the number of rows and columns in a matrix, and the length() function returns the total number of values stored in a matrix.

% dim( mat ) [1] 3 3 % length( mat ) [1] 9

Finally, the most useful property of matrix variables is the ability to use them to perform linear algebra operations. These can be element-wise, where both matrices have the same number of rows and columns, or matrix-wise, where operations like matrix multiplication are applied. transpose, inverse, and determinant functions are also available either directly within R or through the pracma package.

% mat_a <- matrix( c( 3, 2, 5, 2, 3, 2, 5, 2, 4 ), nrow=3, ncol=3 ) % mat_b <- matrix( c( 3, 2, 8, 6, 3, 2, 5, 2, 4 ), nrow=3, ncol=3 ) % # element-wise operations % % print( mat_a + mat_b ) [,1] [,2] [,3] [1,] 6 8 10 [2,] 4 6 4 [3,] 13 4 8 % print( mat_a / mat_b ) [,1] [,2] [,3] [1,] 1.000 0.333333 1 [2,] 1.000 1.000000 1 [3,] 0.625 1.000000 1 % % # matrix multiplication % % print( mat_a %*% mat_b ) [,1] [,2] [,3] [1,] 53 34 39 [2,] 28 25 24 [3,] 51 44 45 % % # transpose, determinant, solve (inverse with given args) % % print( t( mat_a ) ) [,1] [,2] [,3] [1,] 3 2 5 [2,] 2 3 2 [3,] 5 2 4 % print( det( mat_b ) ) [1] -28 % print( solve( mat_b ) ) [,1] [,2] [,3] [1,] -0.2857143 0.5 0.1071429 [2,] -0.2857143 1.0 -0.1428571 [3,] 0.7142857 -1.5 0.1071429 % % # inverse from pracma % % library( pracma ) % print( inv( mat_b ) ) [,1] [,2] [,3] [1,] -0.2857143 0.5 0.1071429 [2,] -0.2857143 1.0 -0.1428571 [3,] 0.7142857 -1.5 0.1071429

Data Frame

A data frame is a table, a two-dimensional structure where each column contains values for one attribute or property, and each row contains a sample with one value for every attribute (column). Data frames extend matrices in a way that is similar to how lists extend vectors. Perhaps most importantly, the data frame columns can contain different types of values from one another. each row and column is named, either explicitly or implicitly (e.g., you may choose to allow R to number the rows sequentially starting at 1). The following guidelines apply to data frames.

  1. Column names are non-empty.
  2. Row names are unique.
  3. Data in a data frame is numeric, character, or factor.
  4. Each column has the same number of values.

The simplest was to create a data frame is to define its column names and values during data frame initialization. The row names can either be defined or left to the default \(1 \ldots n\) values.

% df <- data.frame( % emp_id = c(1:5), % emp_nm = c( "rick", "dan", "michelle", "ryan", "gary" ), % salary = c(623.3, 515.2, 611.0, 729.0, 843.25), % start_dt = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")) % ) % print(df) emp_id emp_nm salary start_dt 1 1 rick 623.30 2012-01-01 2 2 dan 515.20 2013-09-23 3 3 michelle 611.00 2014-11-15 4 4 ryan 729.00 2014-05-11 5 5 gary 843.25 2015-03-27 % str(df) 'data.frame': 5 obs. of 4 variables: $ emp_id : int 1 2 3 4 5 $ emp_nm : chr "rick" "dan" "michelle" "ryan" ... $ salary : num 623 515 611 729 843 $ start_dt: date, format: "2012-01-01" "2013-09-23" ...

Data in a data frame can be summarized using the summary() function. This returns the minimum, median, mean, maximum, and the 1st and 3rd quartile boundaries.

% print( summary( f ) ) emp_id emp_nm salary start_dt Min. :1 length:5 min. :515.2 min. :2012-01-01 1st qu.:2 class :character 1st qu.:611.0 1st qu.:2013-09-23 Median :3 mode :character median :623.3 median :2014-05-11 Mean :3 mean :664.4 mean :2014-01-14 3rd qu.:4 3rd qu.:729.0 3rd qu.:2014-11-15 Max. :5 max. :843.2 max. :2015-03-27

Individual columns can be extracted from a data frame by using their respective names. Rows are extracted using indexing based on row and column position when a subset of columns is requested. Columns are returned as vectors. Rows are returned in a data frame.

% print( df$emp_nm ) [1] "rick" "dan" "michelle" "ryan" "gary" % str( df$emp_nm ) chr [1:5] "rick" "dan" "michelle" "ryan" "gary" % print( df[1:2,] ) emp_id emp_nm salary start_dt 1 1 rick 623.3 2012-01-01 2 2 dan 515.2 2013-09-23 % print( df[c(3,5), c(2,4)] ) emp_nm start_dt 3 michelle 2014-11-15 5 gary 2015-03-27

If you want to add columns to a data frame, you can simply define a new column with a given name and assign a vector to it. The vector must be the same length as the existing columns in the data frame, that is, it must have a value for every row in the data frame. You can explicitly set the length of the vector to the number of rows in the data frame to guarantee this. If the vector is too short, missing positions will be filled with na (not available). If the vector is too long, it will be truncated to match the number of rows in the data frame.

% dept_c <- c( "it", "finance", "hr" ) % vacn_c <- c( "jun", "jan", "may", "dec", "oct", "mar" ) % length( dept_c ) <- nrow( df ) % length( vacn_c ) <- nrow( df ) % df$dept <- dept_c % df$vacation <- vacn_c % print(df) emp_id emp_nm salary start_dt dept vacation 1 1 rick 623.30 2012-01-01 it jun 2 2 dan 515.20 2013-09-23 finance jan 3 3 michelle 611.00 2014-11-15 hr may 4 4 ryan 729.00 2014-05-11 <NA> dec 5 5 gary 843.25 2015-03-27 <NA> oct

A similar approach can be used to add rows to a data frame. First, create a list with the value(s) you want in the new row. Set the length() of the list to be equal to the number of columns in the data frame that will hold the new row. Unfortunately, if the list is shorter than required, empty positions will be filled with the string NULL rather than na, so we need to convert any NULL strings to na. Once this is done, we set the names of the list's entries to match the column names of the existing data frame, then use rbind() to bind the data frame and the list together, producing a result with the new row added to the end of the existing data frame.

% new_r <- list( 6, "bill", 725.5, as.Date( "2015-01-15" ) ) % length( new_r ) <- ncol( df ) % new_r[ new_r == "NULL" ] <- NA # convert "NULL" string on short list to na % names( new_r ) <- names( df ) # match names on list and data frame % df <- rbind( df, new_r ) # append new row to data frame % print( df ) emp_id emp_nm salary start_dt dept vacation 1 1 rick 623.30 2012-01-01 it jun 2 2 dan 515.20 2013-09-23 finance jan 3 3 michelle 611.00 2014-11-15 hr may 4 4 ryan 729.00 2014-05-11 <NA> dec 5 5 gary 843.25 2015-03-27 <NA> oct 6 6 bill 725.50 2015-01-15 <NA> <NA>

One of the most useful operations on data frames is conditional indexing. Here, rows in a data frame are extracted based on conditions applied to a row's column values. Only rows whose values meet the conditions are returned. Consider the following example, where we use the chickwts dataset to extract chicks fed with sunflower seeds.

% chick_df <- data.frame( chickwts ) % summary( chick_df ) weight feed Min. :108.0 casein :12 1st qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd qu.:323.5 soybean :14 Max. :423.0 sunflower:12 % % cond_idx <- ( chick_df[ "feed" ] == "sunflower" ) % print( c( cond_idx ) ) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE % % sub_df <- chick_df[ cond_idx, ] % print( sub_df ) weight feed 37 423 sunflower 38 340 sunflower 39 392 sunflower 40 339 sunflower 41 341 sunflower 42 226 sunflower 43 320 sunflower 44 295 sunflower 45 334 sunflower 46 322 sunflower 47 297 sunflower 48 318 sunflower % % cond_idx <- ( chick_df[ "feed" ] == "sunflower" ) & ( chick_df[ "weight" ] > 350 ) % sub_df <- chick_df[ cond_idx, ] % print( sub_df ) weight feed 37 423 sunflower 39 392 sunflower

The conditional index cond_idx contains TRUE for rows that meet the condition of a chick fed sunflower seeds and FALSE otherwise. Notice that when we extract the subset of rows, we include a comma in the index operation, sub_df <- df[ cond_idx, ]. This is necessary to extract the entire row with all its columns. The result is twelve chicks from the original data frame. The second example shows how you can specify multiple conditions using the boolean and operator &.

Conditionals

We've already seen that an R program runs by executing the first statement in the code and continuing with each successive statement until it reaches the end of the program. This doesn't allow for very complicated programs. What if want to control the flow of execution, that is, what if we want one part of the program to be executed in some cases, but another part to be executed in different cases?

Conditional statements allow you to control how your program executes. For example, a conditional statement could apply a comparison operator to a variable, then execute a different block of statements depending on the result of the comparison. Or, it could cause a block of statements to be executed repeatedly until some condition is met.

Understanding condition statements is necessary for writing even moderately complicated programs. We discuss some common R conditional operators below, and give details on how to structure your code within a conditional statement.

if-then-else

To start, we'll discuss the if-then-else conditional. Described in simple terms, this is used in a program to say, "if some condition is true, then do this, else do that."

As an example, suppose we have a variable grade that holds a student's numeric grade on the range 0–100. We want to define a new variable passed that's set to TRUE if the student's grade is 50 or higher, or FALSE if the grade is less than 50. The following R conditional will do this.

% grade <- 75 % if ( grade >= 50 ) { % passed <- TRUE % } else { % passed <- FALSE % } % % print( passed ) [1] TRUE

Although this statement appears simple, there are a number of important details to discuss.

Interestingly, the else part of the conditional is optional. The following code will produce the same result as the first example.

% passed <- FALSE % grade <- 75 % if ( grade >= 50 ) % passed <- TRUE % print( passed ) [1] TRUE

Suppose we wanted to not only define pass or fail, but also assign a letter grade for the student. We could use a series of if-then statements, one for each possible letter grade. A better way is to use else if, which defines else-if code blocks. Now, we're telling a program, "if some condition is true, then do this, else if some other condition is true, then do this, else do that." you can include as many else-if statements as you want in an if-then-else conditional.

% grade <- 75 % if ( grade >= 90 ) { % passed <- TRUE % letter <- 'A' % } else if ( grade >= 80 ) { % passed <- TRUE % letter <- 'B' % } else if ( grade >= 65 ) { % passed <- TRUE % letter <- 'C' % } else if ( grade >= 50 ) { % passed <- TRUE % letter <- 'D' % } else { % passed <- FALSE % letter <- 'F' % } % print( passed ) [1] TRUE % print( letter ) [1] "C"

while

Another common situation is the need to execute a code block until some condition is met. This is done with a while conditional. Here, we're telling the program "while some condition is true, do this." for example, suppose we wanted to print the square roots of values on the range 1–15.

% i <- 1 % while ( i <= 15 ) { % print( paste( "The square root of", i, "is", sqrt( i ) ) ) % i <- i + 1 % } [1] "The square root of 1 is 1" [1] "The square root of 2 is 1.4142135623731" … [1] "The square root of 15 is 3.87298334620742"

Notice that the variable that's compared in the while conditional normally must be updated in the conditional's code block. If you don't update the conditional variable, a comparison that initially evaluates to TRUE will never evaluate to FALSE, which means the while loop will execute forever. For example, consider the following code block.

% i <- 1 % while ( i <= 15 ) { % print( paste( "The square root of", i, "is", sqrt( i ) ) ) % } [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" [1] "The square root of 1 is 1" …

Without the i <- i + 1 statement to update i in the conditional's code block, the while conditional never fails, giving us the same output over and over. You can use ctrl+c to halt your program if it's caught in an infinite loop like this.

for

A final conditional that is very common is a for loop. Here, we're telling a program "execute this code block for some list of values." for can work on any list of values, but it's often applied to a numeric range. Numbers separated with a colon can be used to create an inclusive sequence of numerics in one-unit increments.

% print( 5:10 ) [1] 5 6 7 8 9 10 % print( -15:-10 ) [1] -15 -14 -13 -12 -11 -10 % print( 1.5:3.5 ) [1] 1.5 2.5 3.5

Specifying two values like 2:5 defines a starting value of 2 and an ending value of 5. This generates an integer list from the starting value, up to and including the ending value: 2 3 4 5. If you want to increment by a value other than one, you can use the seq function.

% print( seq( 0, 10, 2 ) ) [1] 0 2 4 6 8 10 % print( seq( 6, 8, 0.25 ) ) [1] 6.00 6.25 6.50 6.75 7.00 7.25 7.50 7.75 8.00

Once a list is produced with colon or seq, each value in the list is given to the for conditional's code block, in order. For example, suppose we wanted to print the same set of square roots from 1–15 using a for loop.

% for ( i in 1:15 ) { % print( paste( "The square root of", i, "is", sqrt( i ) ) ) % } [1] "The square root of 1 is 1" [1] "The square root of 2 is 1.4142135623730951" … [1] "The square root of 15 is 3.872983346207417"

The for statement defines a variable to hold the "current" list value. In our case, this variable is called i. 1:15 generates the list 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. The for conditional walks through this list and executes the code block 15 times, first with i set to 1, then with i set to 2, and so on up to the final value of 15. The statement inside the code block uses i to track the current list value, printing square roots from 1 to 15.

We don't need to use colon or seq to execute a for conditional. Any vector or list can be used in a for loop.

% fruit <- list( "apple", "banana", "cherry" ) % for ( nm in fruit ) { % print( paste( nm, '(', nchar( nm ), ')' ) ) % } [1] "apple ( 5 )" [1] "banana ( 6 )" [1] "cherry ( 6 )"

break and next

break

Sometimes we need to exit a for or while loop before its condition evaluates to FALSE. The break statement allows us to do this. For example, suppose we wanted to print the elements of a list of strings, but terminate examining the list if we see the string stop.

% fruit <- list( "apple", "banana", "cherry" ) % for ( i in 1:length( fruit ) ) { % if ( fruit[[ i ]] == "stop" ) { % break % } % print( fruit[[ i ]] ) % } [1] "apple" [1] "banana" [1] "cherry" % fruit <- append( fruit, "stop", 1 ) % str( fruit ) list of 4 $ : chr "apple" $ : chr "stop" $ : chr "banana" $ : chr "cherry" % for ( i in 1:length( fruit ) ) { % if ( fruit[[ i ]] == "stop" ) { % break % } % print( fruit[[ i ]] ) % } [1] "apple"

next

Other times, we want to stop executing a loop's code block, and instead return to check its condition. The next statement allows us to do this. For example, suppose we wanted to print only the odd numbers from 1 to 10.

% for ( i in 1:10 ) { % if ( i %% 2 == 0 ) { % next % } % print( paste( i, "is odd" ) ) % } [1] "1 is odd" [1] "3 is odd" [1] "5 is odd" [1] "7 is odd" [1] "9 is odd"

Loop Practice Problem

Write a set of R statements to compute the average of the following list of numbers.

I recommend you write your program using RStudio and then test it, rather than writing the program directly in the R shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

List Average Solution

for loop

% num <- c( 6, 12, -7, 29, 14, 38, 11, 7 ) % sum <- 0 % for ( n in num ) { % sum <- sum + n % } % print( sum / length( num ) ) [1] 13.75

while loop

% num <- c( 6, 12, -7, 29, 14, 38, 11, 7 ) % i <- 1 % sum <- 0 % while ( i <= length( num ) ) { % sum <- sum + num[ i ] % i <- i + 1 % } % print( sum / length( num ) ) [1] 13.75

Notice that we have to convert the sum to a floating point value (in our case, by casting it with float()) to get the proper average of 13.75. If we had used the statement print( float sum / len( num ) ) instead, R would have return an integer result of 13.

You can download the solution file and run it on your machine, if you want.

Debugging

Inevitably, you'll write some R code that either doesn't do what you expect it to do, or that generates an error message when you try to execute it. When that happens, you'll need to debug the program to locate and correct the error. Consider the following code.

% l <- c( "10", "20", "30" ) % sum <- 0 % for ( val in l ) { % sum <- sum + val % }

If you hit return to close the for loop, R would respond with an error message similar to this.

Error in sum + val: non-numeric argument to binary operator

So, that didn't work. The error message shows the snippet of code that caused the error, and what the error was. The important part of the error is the attempt to explain the problem R encountered. This explanation suggests that R doesn't know how to add (+) non-numeric arguments.

If you look at where the error was reported, it attempted to execute sum <- sum + val. R is claiming one of the variables sum or val is non-numeric. Indeed, sum, is a numeric, but the second variable val is a character. val is a value from the vector l. And, when we look at l, we see that it contains three character values: "10", "20", and "30". This is the problem that R encountered.

There are various ways to fix this problem. One simple solution is to put numerics in the vector, l <- c( 10, 20, 30 ). If you wanted l to contain strings for some reason, you could convert val to be an integer in the add operation using as.numeric().

% l <- c( "10", "20", "30" ) % sum <- 0 % for ( val in l ) { % sum <- sum + as.numeric( val ) % } % print( sum ) [1] 60

Now, R accepts the for loop's body because it understands how to add to numeric variables. The resulting sum is printed after the loop finishes.

Files

One important operation when using R is to read and write data to and from external files. R uses file input/output (file IO) operations to support this. the most common read operations import a text file formatted as a table, or import a text file stored in comma-separated value (csv) format.

airquality.txt: Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 na na 14.3 56 5 5 % aq_tbl <- read.table( "w:/msa/R/airquality.txt" ) % str( aq_tbl ) 'data.frame': 5 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA $ Solar.R: int 190 118 149 313 NA $ Wind : num 7.4 8 12.6 11.5 14.3 $ Temp : int 67 72 74 62 56 $ Month : int 5 5 5 5 5 $ Day : int 1 2 3 4 5 airquality.csv: Ozone,Solar.R,Wind,Temp,Month,Day 1,41,190,7.4,67,5,1 2,36,118,8.0,72,5,2 3,12,149,12.6,74,5,3 4,18,313,11.5,62,5,4 5,na,na,14.3,56,5,5 % aq_tbl <- read.csv( "w:/msa/R/airquality.csv" ) % str( aq_tbl ) 'data.frame': 5 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA $ Solar.R: int 190 118 149 313 NA $ Wind : num 7.4 8 12.6 11.5 14.3 $ Temp : int 67 72 74 62 56 $ Month : int 5 5 5 5 5 $ Day : int 1 2 3 4 5

Notice that in both cases the result is stored as a data.frame. By default, the first line of the file is assumed to contain column names, and the first entry of each subsequent line is assumed to be a row name. If the file has no header line, specify header=FALSE as an argument in read.table or read.csv. Columns will be given generic names V1 V2 … Vn. If the rows are not labelled, specify row.names <- NULL. Rows will be numbered starting at 1.

airquality-no-nm.txt: 41 190 7.4 67 5 1 36 118 8.0 72 5 2 12 149 12.6 74 5 3 18 313 11.5 62 5 4 na na 14.3 56 5 5 % aq_tbl <- read.table( "w:/msa/R/airquality-no-nm.txt", header=FALSE, row.names=NULL ) % names( aq_tbl ) [1] "V1" "V2" "V3" "V4" "V5" "V6" % % rownames( aq_tbl ) [1] "1" "2" "3" "4" "5" % % str( aq_tbl ) 'data.frame': 5 obs. of 6 variables: $ V1: int 41 36 12 18 NA $ V2: int 190 118 149 313 NA $ V3: num 7.4 8 12.6 11.5 14.3 $ V4: int 67 72 74 62 56 $ V5: int 5 5 5 5 5 $ V6: int 1 2 3 4 5

If you want to define the column or row names after the file is read you can use the names or rownames functions. Finally, if your file uses a separator character (a delimiter) other than comma, there is a read.delim function to allow you to read a delimited file. You must define the delimiter with the sep argument.

airquality.sep: Ozone;Solar.R;Wind;Temp;Month;Day 1;41;190;7.4;67;5;1 2;36;118;8.0;72;5;2 3;12;149;12.6;74;5;3 4;18;313;11.5;62;5;4 5;NA;NA;14.3;56;5;5 % aq_tbl <- read.delim( "w:/msa/R/airquality.sep", sep=";" ) % str( aq_tbl ) 'data.frame': 5 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA $ Solar.R: int 190 118 149 313 NA $ Wind : num 7.4 8 12.6 11.5 14.3 $ Temp : int 67 72 74 62 56 $ Month : int 5 5 5 5 5 $ Day : int 1 2 3 4 5

Writing data frames to an output file is similar. R provides the write.table and write.csv functions to write data frames as tabular or csv data. You provide the data frame to write and the path and name of the file to create, for example, write.table( aq_tbl, file="w:/msa/r/airquality-new.txt" ).

Functions

It's possible to write a program as a single, long sequence of statements in the main module. Even for small programs, however, this isn't efficient. First, writing a program this way makes it difficult to examine and understand. Second, if you're performing common operation on different variables, you need to duplicate the code every time you perform that operation.

For example, supposed we wanted to report the maximum of two numeric lists l and m. One obvious way to do it is to write two for loops.

% l <- list( 1, 2, 3 ) % m <- list( 7, 8, 14 ) % max_v <- l[[ 1 ]] % for ( elem in l ) { % if ( elem > max_v ) { % max_v <- elem % } % } % print( max_v ) [2] 3 % max_v <- m[[ 1 ]] % for ( elem in m ) { % if ( elem > max_v ) { % max_v <- elem % } % } % print( max_v ) [1] 14

This has a number of problems, however. What if we had more than just two lists we wanted to query? we'd need to duplicate the for loop once for each list. What if we wanted to do something more complicated than calculating the maximum (e.g., what if we wanted variance instead)? the amount of code we'd need to duplicate would be much longer.

What we really want to do is to have some sort of max_val() operation that we can call whenever we want to calculate the maximum value of a numeric list.

% l <- list( 1, 2, 3 ) % m <- list( 7, 8, 14 ) % print( max_val( l ) ) [1] 3 % print( max_val( m ) ) [1] 14

In R we can define a function to create new operations like max_val(). A function is defined by a function name, the keyword function, an optional argument list in parentheses, and then a function code block that defines what the function does when it's called.

% max_val <- function( num_l ) { % if ( length( num_l ) <= 0 ) { % return( NULL ) % } % max_v <- num_l[[1]] % for ( elem in num_l ) { % if ( elem > max_v ) { % max_v <- elem % } % } % return( max_v ) % }

Functions can take zero or more arguments. A function with no arguments still needs open and close parentheses, func <- function(). A function with multiple arguments separates then with commas, func <- function( a, b ). Once a function is defined, it can be used anywhere, including in other functions. Suppose we now wanted to write a function max_val_list() to compute the maximum value from a list of numeric lists. We can use our max_val() function to help to do this.

% max_val_list <- function( list_of_list ) { % trycatch( # get number of lists, catch if variable is empty % { % list_n <- length( list_of_list ) % }, % error <- function( cond ) { % print( "max_val_list(): list of lists is undefined or invalid" ) % return( NULL ) % } % ) % if ( !is.list( list_of_list ) ) { # ensure argument is a list % print( "max_val_list(): argument is not a list") % return( NULL ) % } % if ( list_n <= 0 ) { # ensure at least one list to examine % print( "max_val_list(): list of lists is empty") % return( NULL ) % } % max_v <- list_of_list[[1]][[1]] % for ( l in list_of_list ) { % max_v_list <- max_val( l ) % if ( max_v_list > max_v ) { % max_v <- max_v_list % } % } % return( max_v ) % }

It's even possible for functions to call themselves. This is known as recursion. The classic example of recursion is the fibonacci sequence. However, we'll demonstrate recursion by developing a recursive algorithm to solve Sudoku, a puzzle where a \(9 \times 9\) grid partially filled with numbers from 1 to 9 is completed based on the following rules.

  1. Every row contains one occurrence of the numbers 1 to 9.
  2. Every column contains one occurrence of the numbers 1 to 9.
  3. Every \(3 \times 3\) panel contains 1 occurrence of the numbers 1 to 9.
An example Sudoku puzzle starting configuration (left) and completed solution (right)

Two common approaches to solve a Sudoku puzzle are brute force and backtracking, each of which vary in simplicity and efficiency.

  1. Brute force. Generate all possible configurations of numbers from 1 to 9 to fill the empty cells. Try every configuration one-by-one until the correct configuration is found. Although easy to understand, the time required for this solution (in order notation) is \(\textrm{o}(9^{n^2})\), which is inefficient. essentially, there are 9 choices for each of the 81 cells on the board, generation a state space of \(9 \times 81 = 729\) boards that must be generated and checked to see if they satisfy the three required criteria for a correct Sudoku solution.
  2. Backtracking. Similar to brute force, but before you try a number, check to ensure it satisfies the Sudoku requirements. If it does, assign it and move to the next cell and do the same thing. If it doesn't, move on to the next number and check again. If no number satisfies the constraints, no solution exists. In this approach, checking for a number to safely assign is the recursive step, since it is repeated over and over until a solution is found or no solution exists. This still has worst case performance of \(\textrm{o}(9^{n^2})\), but backtracking should perform pruning to significantly increase efficiency.

We won't go over the backtracking solution to Sudoku, but you can look at one implementation below. Remember, the efficiency of the solution will depend on how many numbers you provide in the initial board. The fewer the known numbers, the longer it will take to locate the solution.

Sudoku Backtracking Solution

% # sudoku.r % # Recursive backtracking algorithm to solve a 9x9 Sudoku puzzle % # % # Modification history: % # When: Who: Comments: % # % # 16-Jul-24 Christopher G. Healey Initial implementation % % check_location_safe <- function( board, i, j, n ) % # Check if a given board location satisfies Sudoku constraints for a given % # number % # % # board: current Sudoku board % # i: row to validate % # j: column to validate % # n: number to validate % # % # return: TRUE if constraints satisfied, FALSE otherwise % { % row_safe <- row_valid( board, i, n ) % col_safe <- col_valid( board, j, n ) % panel_safe <- panel_valid( board, i, j, n ) % % return ( row_safe && col_safe && panel_safe ) % } # End function check_location_safe % % % col_valid <- function( board, j, n ) % # Check if column has duplicates of a given number % # % # board: current Sudoku board % # j: column to validate % # n: number to validate % # % # return: FALSE if n already in column j, TRUE otherwise % { % for( i in 1:9 ) { % # Is col position not na (has number) but we've seen that number before? % % if ( !is.na( board[ i, j ] ) && board[ i, j ] == n ) { % return( FALSE ) % } % } % % return( TRUE ) % } # End function col_valid % % % find_empty_loc <- function( board ) % # Find an empty location on the board, if none exists the solution has been % # found % # % # board: Sudoku board to query % # % # return: An empty board location as a 2-elem vec, or (na,na) if no empty % # location % { % for( i in 1:9 ) { % for( j in 1:9 ) { % if ( is.na( board[ i, j ] ) ) { % return( c( i, j ) ) % } % } % } % % % return( c( NA, NA ) ) % } # End function find_empty_loc % % % init <- function( init_board ) % # Initialize board based on an initial Sudoku init_board configuration % # % # init_board: initial board configuration as strings "-" for empty or "1"-"9" % { % board <- as.data.frame( matrix( NA, nrow=9, ncol=9 ) ) % % for ( i in 1:9 ) { % for ( j in 1:9 ) { % if ( init_board[ i, j ] == "-" ) { % board[ i, j ] <- NA % } else { % board[ i, j ] <- as.numeric( init_board[ i, j ] ) % } % } # End for all columns in initial init_board % } # End for all rows in initial init_board % % return( board ) % } # End function init % % % panel_valid <- function( board, i, j, n ) % # Check if 3x3 panel has duplicates of a given number % # % # board: current Sudoku board % # i: row to validate % # j: column to validate % # n: number to validate % # % # return: FALSE if n already in row i, TRUE otherwise % { % ul_i <- i - ( ( i - 1 ) %% 3 ) # Find upper-left corner of panel % ul_j <- j - ( ( j - 1 ) %% 3 ) % % for( i in ul_i: ( ul_i + 2 ) ) { % for( j in ul_j: ( ul_j + 2 ) ) { % if ( !is.na( board[ i, j ] ) && board[ i, j ] == n ) { % return( FALSE ) % } % } % } % % return( TRUE ) % } # End function panel_valid % % % print_board <- function( board ) % # Print the board with borders around grid and each panel % # % # board: board to print % { % for( i in 1:9 ) { % if ( ( i - 1 ) %% 3 == 0 ) { # Panel horizontal border? % cat( "+---------+---------+---------+\n" ) % } % % for( j in 1:9 ) { % if ( ( j - 1 ) %% 3 == 0 ) { # Panel vertical border? % cat( "|" ) % } % % if ( is.na( board[ i, j ] ) ) { % v <- "-" % } else { % v <- as.character( board[ i, j ] ) % } % cat( paste( " ", v, " ", sep="" ) ) % } % cat( "|\n" ) % } % cat( "+---------+---------+---------+\n" ) % } # End function print_board % % % recursive_solve <- function( board ) % # Recursively try values and backtrack if value violates % # any of the Sudoku constraints % # % # board: current Sudoku board % # % # return: (TRUE,board) if sol'n found, (FALSE,board) otherwise % { % pos <- find_empty_loc( board ) % % # If no empty positions, board is solved % % if ( is.na( pos[ 1 ] ) && is.na( pos[ 2 ] ) ) { % return( list( TRUE, board ) ) % } % % i <- pos[ 1 ] % j <- pos[ 2 ] % % for ( n in 1:9 ) { % if ( check_location_safe( board, i, j, n ) ) { % board[ i, j ] <- n # Tentative assign n to position (i,j) % % # Try to recursively solve remaining board % % res <- recursive_solve( board ) % if ( res[[ 1 ]] == TRUE ) { # Sol'n found? % board <- res[[ 2 ]] % return( list( TRUE, board ) ) % } % % # Recursive solve attempt failed, reset tentative (i,j) to NA and % # try next number % % board[ i, j ] <- NA % } % } % % # If none of the numbers produced a solution, board can't be solved % % return( list( FALSE, board ) ) % } # End function recursive_solve % % % row_valid <- function( board, i, n ) % # Check if row has duplicates of a given number % # % # board: current Sudoku board % # i: row to validate % # n: number to validate % # % # return: FALSE if n already in row i, TRUE otherwise % { % for( j in 1:9 ) { % # Is row position not na (has number) but we've seen that number before? % % if ( !is.na( board[ i, j ] ) && board[ i, j ] == n ) { % return( FALSE ) % } % } % % return( TRUE ) % } # End function row_valid % % % # Mainline % % init_board <- data.frame( % c1 = c( "3", "5", "-", "-", "9", "-", "1", "-", "-" ), % c2 = c( "-", "2", "8", "-", "-", "5", "3", "-", "-" ), % c3 = c( "6", "-", "7", "3", "-", "-", "-", "-", "5" ), % c4 = c( "5", "-", "-", "-", "8", "-", "-", "-", "2" ), % c5 = c( "-", "-", "-", "1", "6", "9", "-", "-", "-" ), % c6 = c( "8", "-", "-", "-", "3", "-", "-", "-", "6" ), % c7 = c( "4", "-", "-", "-", "-", "6", "2", "-", "3" ), % c8 = c( "-", "-", "3", "8", "-", "-", "5", "7", "-" ), % c9 = c( "-", "-", "1", "-", "5", "-", "-", "4", "-" ) % ) % % board <- init( init_board ) % cat( "initial board...\n" ) % print_board( board ) % cat( "\n" ) % % colnames( board ) <- c( "c1","c2","c3","c4","c5","c6","c7","c8","c9" ) % rownames( board ) <- c( "r1","r2","r3","r4","r5","r6","r7","r8","r9" ) % % res <- recursive_solve( board ) ... % if ( res[[ 1 ]] == TRUE ) { # Sol'n found? % cat( "solution...\n" ) % board <- res[[ 2 ]] % print_board( board ) % } else { % cat( "no solution found\n" ) % }

tidyverse

The tidyverse is a relatively new set of packages designed for performing data since in R. Functionality provided by tidyverse and its constituent packages are purpose-built to perform common data science tasks in a consistent and efficient manner. tidyverse is made up of a number of core packages, each designed to address a common need in data analytics.

  1. dplyr. A data manipulation package (cheat sheet).
  2. forcats. A categorical variable management package (cheat sheet).
  3. ggplot2. A visualization package (cheat sheet).
  4. lubridate. A date–time management package (cheat sheet).
  5. purrr. A package to manage functions and vectors (cheat sheet).
  6. readr. A package to simplify reading rectangular data from delimited files (cheat sheet).
  7. stringr. A set of operations for string manipulation (cheat sheet).
  8. tibble. A modern reinterpretation of a data frame.
  9. tidyr. A package of data wrangling operations to standardize data storage within tidyverse (cheat sheet).

The designers of tidyverse recommend using the book "R for data science, 2nd edition" as a tutorial and learning reference. The entire content of the book is also available online. Cheat sheets for each package that give an overview of the package's purpose and available functions are linked in the list above.

Tibbles

Since operations in tidyverse are performed on tibbles, it is important to understand what a tibble is and how it works. In simple terms, tibbles are extensions of data frames that are meant to make working with rows and columns of data easier.

The "R for data science" book highlights two main differences between tibbles and data frames: printing and subsetting. Tibbles redefine the print statement in the following ways.

% library(tidyverse) % % tib <- as_tibble( iris ) # load iris dataset into a tibble % print( tib ) # a tibble: 150 × 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # i 140 more rows

Subsetting works in a very similar manner to data frames, which is not surprising since tibbles are built on top of data frames. You can index individual columns using $, [[...]], or by a specific column position starting at 1.

% library(tidyverse) % % tib <- as_tibble( iris ) # load iris dataset into a tibble % print( tib$Sepal.Width ) [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 [145] 3.3 3.0 2.5 3.0 3.4 3.0 % % print( tib[["Sepal.Width"]] ) [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 [145] 3.3 3.0 2.5 3.0 3.4 3.0 % % print( tib[2] ) # a tibble: 150 × 1 Sepal.Width <dbl> 1 3.5 2 3 3 3.2 4 3.1 5 3.6 6 3.9 7 3.4 8 3.4 9 2.9 10 3.1 # i 140 more rows

~ and . operators

The tilde (or twiddle) ~ operator separates or specifies arguments to a formula. For example, lm(wages ~ yearsed, data=df) would take a data frame df and perform linear regression using a predictor \(x\) of yearsed and a response \(y\) of wages (i.e., what is the linear relationship between years of education and wages?)

The dot . operator is often used as shorthand for "all columns". So aggregate(data=iris, . ~ Species, mean) would return the mean for every column in the iris data frame, grouped by iris species.

% library(tidyverse) % % aggregate(data=iris, . ~ Species, mean ) % Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026

Pipe %>% Operator

The pipe operator %>% allows chaining sequences of operations and applying them to a variable in a single statement, rather than using multiple statements. This is analogous to composition in mathematics. suppose we had a dataset \(x\) and two functions \(f()\) and \(g()\). If we wanted to apply \(f\) to \(x\), then apply \(g\) to the result we could either create a temporary result variable: \(r = f(x), g( r )\) or we could compose the two operations into a single statement: \(g(f(x)))\). pipes are used to perform a similar type of composition in r: print( x %>% f %>% g ). here, we take the dataset x, pass it to function f(), then take the result and pass that to function g(), printing the result.

Suppose you wanted to take the iris dataset, extract rows with a Sepal.Length greater than 6, then aggregate the mean of Petal.Length and Petal.Width grouped by Species.

% library(tidyverse) % % tib <- as_tibble( iris ) % sub_tib <- tib[ tib$Sepal.Length > 6, ] % aggregate( data=sub_tib, cbind( Petal.Length, Petal.Width ) ~ Species, mean ) Species Petal.Length Petal.Width 1 versicolor 4.585000 1.420000 2 virginica 5.682927 2.056098 % % tib %>% .[ tib$Sepal.Length > 6, ] %>% aggregate( cbind( Petal.Length, Petal.Width ) ~ Species, mean ) % Species Petal.Length Petal.Width 1 versicolor 4.585000 1.420000 2 virginica 5.682927 2.056098

Notice the statement ".[ tib$Sepal.Length > 6, ]". For functions that do not expect explicit input data, we use dot to refer to data from the previous step in the pipeline. In this case, the index operator does not expect explicit data like aggregate() does, so we use . to tell R that we are indexing the previous data element in the pipeline, which is the tib tibble.

Tidy Data

In tidyverse the different packages are designed to expect tidy data. This has a specific meaning.

  1. Every variable is stored in its own column.
  2. Every observation or sample is stored in its own row.
  3. Every value is stored in its own cell in a tibble.

Making your data tidy has the added advantage of enforcing a common structure to each dataset. This makes them easier to understand and work with.

The tidyr package contains a variety of functions to take data and manipulate it into a tidy form. This is also known as data preprocessing or data wrangling. Once you start working with real data, you'll quickly discover it is rarely tidy, so a set of functions that make it easy to wrangle raw data into a tidy form is invaluable. tidyr provides functions to handle the following common cases.

Pivot

To handle variables spread over multiple columns or observations stored in multiple rows, we pivot the data using pivot_longer() or pivot_wider(), respectively. Consider the following dataset.

% library(tidyverse) % % tib <- as_tibble( matrix( nrow=3, ncol=3 ) ) % colnames(tib) <- c( "Country", "1999", "2000" ) % tib$Country <- c( "Afghanistan", "Brazil", "China" ) % tib$"1999" <- c(745, 37737, 212258) % tib$"2000" <- c(2666, 80488, 213766) % print( tib ) # a tibble: 3 × 3 Country `1999` `2000` <chr> <dbl> <dbl> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766

Here, year should be one column, and the value for a (country,year) pair should be a second column. Instead, individual years have been stored as separate columns. To correct this, we use pivot_longer(). pivot_longer() requires you to define which columns are values and not variables (the years 1999 and 2000, in our example), the name of the new column to move the column names to, and the name of the new column to move corresponding column values to. Assuming tib holds the original data, the following pivot_longer() command re-organizes the tibble to have separate year and cases columns.

% tib %>% pivot_longer(c("1999", "2000"), names_to = "year", values_to = "cases") # a tibble: 6 × 3 country year cases <chr> <chr> <dbl> 1 Afghanistan 1999 745 2 Afghanistan 2000 2666 3 Brazil 1999 37737 4 Brazil 2000 80488 5 China 1999 212258 6 China 2000 213766

Notice that the original columns `1999` and `2000` are replaced with year and cases. This makes the tibble longer, which explains the name of the pivot command. pivot_longer has converted cases to a double, but years is a character. if you wanted it to be numeric, you can use the names_transform argument.

% tib %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases", names_transform = list(year=as.numeric) ) # a tibble: 6 × 3 country year cases <chr> <dbl> <dbl> 1 Afghanistan 1999 745 2 Afghanistan 2000 2666 3 Brazil 1999 37737 4 Brazil 2000 80488 5 China 1999 212258 6 China 2000 213766

The opposite situation occurs when data for a single observations or sample is spread over multiple rows. In this case, pivot_wider() is used to collect data for a common observation and store it in a single tibble row. consider the following tibble, with data for individual countries spread over four separate rows.

% library(vctrs) % % tib <- as_tibble( matrix( nrow=12, ncol=4 ) ) % colnames(tib) <- c( "Country", "Year", "Type", "Count" ) % tib$Country <- vec_rep_each( c( "Afghanistan", "Brazil", "China" ), times=4 ) % tib$Year <- vec_rep( c( 1999, 1999, 2000, 2000 ), times=3 ) % tib$Type <- vec_rep( c( "cases", "population" ), times=6 ) % tib$Count <- c( 745, 19987071, 2666, 20595360, 37737, 172006362, 80488, 174504898, 212258, 127295272, 213766, 1280428583 ) % % print(tib) % # a tibble: 12 × 4 Country Year Type Count <chr> <dbl> <chr> <dbl> 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 127295272 11 China 2000 cases 213766 12 China 2000 population 1280428583

Here, we need Cases and Population to be columns, rather than separate rows. To do this, we use pivot_wider() and provide the columns to take the new variable names from (type in the current tibble) and the column to take the corresponding values from (count).

% tib %>% pivot_wider( names_from = "Type", values_from = "Count") % # a tibble: 6 × 4 country year cases population <chr> <dbl> <dbl> <dbl> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 127295272 6 China 2000 213766 1280428583

Missing Values

Another extremely common issue with raw data is missing values. These can either be explicit, where na or some other delimiter indicates a value is missing, or implicit, where the value is simply not present in any form in the dataset. This leads to two important questions. first, how can we convert implicit missing values into an explicit form so we know they exist? second, how should we manage explicit missing values? consider the following tibble with an explicit missing value for 2015, q4 (denoted as na) and an implicit missing values for 2016, q1 (not present in the tibble).

% library(tidyverse) % % tib <- tibble( % year = c( 2015, 2015, 2015, 2015, 2016, 2016, 2016 ), % qtr = c( 1, 2, 3, 4, 2, 3, 4 ), % return = c( 1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66 ) % ) % % print( tib ) # a tibble: 7 × 3 year qtr return <dbl> <dbl> <dbl> 1 2015 1 1.88 2 2015 2 0.59 3 2015 3 0.35 4 2015 4 NA 5 2016 2 0.92 6 2016 3 0.17 7 2016 4 2.66

To identify implicit missing values we can use pivot_wider() to put qtr in one column and individual years in additional columns. This will automatically insert NA into the tibble for any (qtr,year) pair without a value.

% tib_imp <- tib %>% pivot_wider( names_from=year, values_from=return) % % print( tib_imp ) # a tibble: 4 × 3 qtr `2015` `2016` <dbl> <dbl> <dbl> 1 1 1.88 NA 2 2 0.59 0.92 3 3 0.35 0.17 4 4 NA 2.66

Now the first row shows the (previously implicit) NA for q1 2016. However, we know from previous discussion that this is not a tidy representation of the data. Year values should be in an individual column, not stored as separate columns. As before, we can correct this with a follow-on pivot_longer().

% tib_exp <- tib %>% % pivot_wider( names_from=year, values_from=return) %>% % pivot_longer( % cols=c( "2015", "2016" ), % names_to="year", % values_to="return", % values_drop_na=TRUE % ) % % print( tib_exp ) # a tibble: 6 × 3 qtr year return <dbl> <chr> <dbl> 1 1 2015 1.88 2 2 2015 0.59 3 2 2016 0.92 4 3 2015 0.35 5 3 2016 0.17 6 4 2016 2.66

Notice we have also opted to remove NAs from the final tibble with the option values_drop_na = TRUE in the pivot_longer() function.

Data Manipulation With dplyr

Once you have your data in a "tidy" format, you can manipulate it with functions from the dplyr package. dplyr allows you to select, filter, modify, reorder, and summarize tibbles to focus and rearrange them, and to add new columns based on values in existing columns. A set of single table verbs are defined to manipulate rows, columns, or groups of rows.

  1. Rows.
  2. Columns.
  3. Groups of rows.

The code block below demonstrates each of these dplyr functions applied to the starwars tibble.

% library( tidyverse ) % % print( starwars ) # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]> 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]> 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]> 5 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]> 6 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 7 Beru Whitesun Lars 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 8 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 9 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [1]> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human <chr [6]> <chr [1]> <chr [5]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % # filter() % print( "row filter:" ) [1] "row filter:" % % print( starwars %>% filter( skin_color=="light", eye_color=="blue" ) ) # A tibble: 3 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 2 Beru Whitesun Lars 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 3 Lobot 175 79 none light blue 37 male masculine Bespin Human <chr [1]> <chr [0]> <chr [0]> % % # slice() % print( "row slice:" ) [1] "row slice:" % % print( starwars %>% slice(5:10) ) # return rows 5-10 inclusive # A tibble: 6 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]> 2 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 3 Beru Whitesun Lars 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 4 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 5 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [1]> 6 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human <chr [6]> <chr [1]> <chr [5]> % % print( starwars %>% slice_sample( n=5 ) ) # randomly sample 5 rows # A tibble: 5 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Cliegg Lars 183 NA brown fair blue 82 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [0]> 2 Yoda 66 17 white green brown 896 male masculine <NA> Yoda's species <chr [5]> <chr [0]> <chr [0]> 3 Ki-Adi-Mundi 198 82 white pale yellow 92 male masculine Cerea Cerean <chr [3]> <chr [0]> <chr [0]> 4 Watto 137 NA black blue, grey yellow NA male masculine Toydaria Toydarian <chr [2]> <chr [0]> <chr [0]> 5 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]> % % print( starwars %>% slice_sample( prop=0.05 ) ) # randomly sample 5% of rows # A tibble: 4 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]> 2 Raymus Antilles 188 79 brown light brown NA male masculine Alderaan Human <chr [2]> <chr [0]> <chr [0]> 3 IG-88 200 140 none metal red 15 none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]> 4 Wedge Antilles 170 77 brown fair hazel 21 male masculine Corellia Human <chr [3]> <chr [1]> <chr [1]> % % # arrange() % print( "row arrange:" ) [1] "row arrange:" % % print( starwars %>% arrange( height, mass ) ) # sort rows by height, within height by mass # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Yoda 66 17 white green brown 896 male masculine <NA> Yoda's species <chr [5]> <chr [0]> <chr [0]> 2 Ratts Tyerel 79 15 none grey, blue unknown NA male masculine Aleen Minor Aleena <chr [1]> <chr [0]> <chr [0]> 3 Wicket Systri Warrick 88 20 brown brown brown 8 male masculine Endor Ewok <chr [1]> <chr [0]> <chr [0]> 4 Dud Bolt 94 45 none blue, grey yellow NA male masculine Vulpter Vulptereen <chr [1]> <chr [0]> <chr [0]> 5 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]> 6 R4-P17 96 NA none silver, red red, blue NA none feminine <NA> Droid <chr [2]> <chr [0]> <chr [0]> 7 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 8 Sebulba 112 40 none grey, red orange NA male masculine Malastare Dug <chr [1]> <chr [0]> <chr [0]> 9 Gasgano 122 NA none white, blue black NA male masculine Troiken Xexto <chr [1]> <chr [0]> <chr [0]> 10 Watto 137 NA black blue, grey yellow NA male masculine Toydaria Toydarian <chr [2]> <chr [0]> <chr [0]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % print( starwars %>% arrange( desc( name ) ) ) # sort rows descending by name # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Zam Wesell 168 55 blonde fair, green, yellow yellow NA female feminine Zolan Clawdite <chr [1]> <chr [1]> <chr [0]> 2 Yoda 66 17 white green brown 896 male masculine <NA> Yoda's species <chr [5]> <chr [0]> <chr [0]> 3 Yarael Poof 264 NA none white yellow NA male masculine Quermia Quermian <chr [1]> <chr [0]> <chr [0]> 4 Wilhuff Tarkin 180 NA auburn, grey fair blue 64 male masculine Eriadu Human <chr [2]> <chr [0]> <chr [0]> 5 Wicket Systri Warrick 88 20 brown brown brown 8 male masculine Endor Ewok <chr [1]> <chr [0]> <chr [0]> 6 Wedge Antilles 170 77 brown fair hazel 21 male masculine Corellia Human <chr [3]> <chr [1]> <chr [1]> 7 Watto 137 NA black blue, grey yellow NA male masculine Toydaria Toydarian <chr [2]> <chr [0]> <chr [0]> 8 Wat Tambor 193 48 none green, grey unknown NA male masculine Skako Skakoan <chr [1]> <chr [0]> <chr [0]> 9 Tion Medon 206 80 none grey black NA male masculine Utapau Pau'an <chr [1]> <chr [0]> <chr [0]> 10 Taun We 213 NA none grey black NA female feminine Kamino Kaminoan <chr [1]> <chr [0]> <chr [0]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % # select() % print( "column select:" ) [1] "column select:" % % print( starwars %>% select( name, hair_color, eye_color ) ) # select name, hair colour, eye colour columns # A tibble: 87 × 3 name hair_color eye_color <chr> <chr> <chr> 1 Luke Skywalker blond blue 2 C-3PO <NA> yellow 3 R2-D2 <NA> red 4 Darth Vader none yellow 5 Leia Organa brown brown 6 Owen Lars brown, grey blue 7 Beru Whitesun Lars brown blue 8 R5-D4 <NA> red 9 Biggs Darklighter black brown 10 Obi-Wan Kenobi auburn, white blue-gray # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % print( starwars %>% select( name:eye_color ) ) # select columns from name to eye colour # A tibble: 87 × 6 name height mass hair_color skin_color eye_color <chr> <int> <dbl> <chr> <chr> <chr> 1 Luke Skywalker 172 77 blond fair blue 2 C-3PO 167 75 <NA> gold yellow 3 R2-D2 96 32 <NA> white, blue red 4 Darth Vader 202 136 none white yellow 5 Leia Organa 150 49 brown light brown 6 Owen Lars 178 120 brown, grey light blue 7 Beru Whitesun Lars 165 75 brown light blue 8 R5-D4 97 32 <NA> white, red red 9 Biggs Darklighter 183 84 black light brown 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % print( starwars %>% select( ends_with( "color" ) )) # selects columns whose name ends w/color # A tibble: 87 × 3 hair_color skin_color eye_color <chr> <chr> <chr> 1 blond fair blue 2 <NA> gold yellow 3 <NA> white, blue red 4 none white yellow 5 brown light brown 6 brown, grey light blue 7 brown light blue 8 <NA> white, red red 9 black light brown 10 auburn, white fair blue-gray # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % # rename() % print( "column rename:" ) [1] "column rename:" % % print( starwars %>% rename("home.world"="homeworld" ) ) # rename homeworld column to home.world # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender home.world species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]> 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]> 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]> 5 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]> 6 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 7 Beru Whitesun Lars 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 8 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 9 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [1]> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human <chr [6]> <chr [1]> <chr [5]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % # mutate() % % print( "column mutate (create height_m, bmi columns):" ) [1] "column mutate (create height_m, bmi columns):" % % print( % starwars %>% % mutate( height_m = height / 100, bmi = mass / ( height_m ^ 2 ) ) %>% % select( bmi, height_m, everything() ) % ) # A tibble: 87 × 16 bmi height_m name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships <dbl> <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 1 26.0 1.72 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]> 2 26.9 1.67 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]> 3 34.7 0.96 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]> 4 33.3 2.02 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]> 5 21.8 1.5 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]> 6 37.9 1.78 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 7 27.5 1.65 Beru Whitesun Lars 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]> 8 34.0 0.97 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 9 25.1 1.83 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [1]> 10 23.2 1.82 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human <chr [6]> <chr [1]> <chr [5]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % # relocate() % % print( "column relocate:" ) [1] "column relocate:" % % print( % starwars %>% % relocate( sex:homeworld, .before=height ) % ) # A tibble: 87 × 14 name sex gender homeworld height mass hair_color skin_color eye_color birth_year species films vehicles starships <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <list> <list> <list> 1 Luke Skywalker male masculine Tatooine 172 77 blond fair blue 19 Human <chr [5]> <chr [2]> <chr [2]> 2 C-3PO none masculine Tatooine 167 75 <NA> gold yellow 112 Droid <chr [6]> <chr [0]> <chr [0]> 3 R2-D2 none masculine Naboo 96 32 <NA> white, blue red 33 Droid <chr [7]> <chr [0]> <chr [0]> 4 Darth Vader male masculine Tatooine 202 136 none white yellow 41.9 Human <chr [4]> <chr [0]> <chr [1]> 5 Leia Organa female feminine Alderaan 150 49 brown light brown 19 Human <chr [5]> <chr [1]> <chr [0]> 6 Owen Lars male masculine Tatooine 178 120 brown, grey light blue 52 Human <chr [3]> <chr [0]> <chr [0]> 7 Beru Whitesun Lars female feminine Tatooine 165 75 brown light blue 47 Human <chr [3]> <chr [0]> <chr [0]> 8 R5-D4 none masculine Tatooine 97 32 <NA> white, red red NA Droid <chr [1]> <chr [0]> <chr [0]> 9 Biggs Darklighter male masculine Tatooine 183 84 black light brown 24 Human <chr [1]> <chr [0]> <chr [1]> 10 Obi-Wan Kenobi male masculine Stewjon 182 77 auburn, white fair blue-gray 57 Human <chr [6]> <chr [1]> <chr [5]> # ℹ 77 more rows # ℹ Use `print(n = ...)` to see more rows % % print( % starwars %>% % group_by( species, sex ) %>% % summarize( height=mean(height, na.rm=TRUE ), mass=mean(mass, na.rm=TRUE) ) %>% % select( species, height, mass ) % ) # A tibble: 41 × 3 # Groups: species [38] species height mass <chr> <dbl> <dbl> 1 Aleena 79 15 2 Besalisk 198 102 3 Cerean 198 82 4 Chagrian 196 NaN 5 Clawdite 168 55 6 Droid 131. 69.8 7 Dug 112 40 8 Ewok 88 20 9 Geonosian 183 80 10 Gungan 209. 74 # ℹ 31 more rows # ℹ Use `print(n = ...)` to see more rows

Although there are additional functions in the tidyverse foundation packages, we will not continue exploring them in this introduction. the basics of data wrangling, data loading, and data manipulation have all been covered through base R and the tidyr, dpylr, and stringr packages. Additional information is available either online or in the "r for data science" textbook if you need to make use of any of the other packages.

We have not explored ggplot2 for visualization. This will be done later in the fall semester during the visualization module. We discuss building dashboards with R+shiny, which starts with a detailed overview of ggplot2.

R Markdown

R markdown is a notebook-like interface that allows you to combine plain text, R code, and results from executing the code in a single file. R markdown files are edited in rstudio, and are meant to provide both analysis in R and explanations of the analysis in a corresponding text block.

An R markdown file uses markdown syntax, a simple way of annotating a text file to provide basic formatting. For example, to "mark" text as a header, you precede text with hashtags, one for each heading level you want (e.g., # heading level 1, # heading level 2, and so on). Bold text is surrounded by double asterisks, and italic text is surrounded by single underscores (e.g., **bold** and _italic_). R markdown supports headers of different levels; bold and italic text; blockquotes; ordered and unordered lists; images; tables; urls; and R code blocks.

RMarkdown Pipeline
The R Markdown pipeline: markdown code is knitr'd to markdown format, then converted into the requested output format using the pandoc markdown-to-markdown converter

Once your R Markdown file is complete and running correctly, you knit the markdown file (Ctrl+Shift+K or the knit button at the top of the R Markdown editing pane). This converts the markdown file into an HTML document, a PDF document, or Microsoft Word output. The output is a formatted version of the markdown file with all code blocks executed and their output included. This is mean as a way to share the results of executing the R Markdown file.

When you create a new R Markdown file or load a markdown file (R Markdown files normally have a file extension of Rmd), the programming window in RStudio switches to a notebook format, showing code blocks with light grey backgrounds and explanatory text and code output with white backgrounds. Each code block has three icons in the upper-right corner: a gear to set options for the code block, a downward arrow and rectangle to execute all code blocks prior to the given code block, and a rightward arrow to execute the code block. This highlights an important property of R Markdown files: when a code block is executed, any variables or other results it generates are saved for use in subsequent code blocks. This means you often need to run previous code blocks to "setup" the environment to have the information needed to run the current code block. It also means the code blocks roughly correspond to an R program, divided into logical "code blocks" to divide the program into meaning chunks of execution.

R Markdown
An R Markdown file in RStudio, showing code blocks with gear, downward arrow + rectangle, and rightward arrow icons (settings, execute all previous code blocks, execute this code block) and corresponding text (in blue) and output generated by each code block.

Every R Markdown file starts with a YAML (Yet Another Markdown Language) header that normally defines the markdown file's title, author, and date of modification, although additional information can be included if needed. YAML headers start and end with three consecutive hyphens.

--- title: "Title" author: "Author" date: 27-Jun-2024 output: html_document ---

R Studio will automatically insert a default YAML header with Title, Author, and Date when you create a new R Markdown file. Given this introduction, use the following steps in R Studio to create a default R Markdown file.

  1. Choose File → New File → R Markdown...
  2. In the dialog, select Document and enter an appropriate Title, Author, and Date. Choose the Default Output Format, usually HTML or PDF.
  3. Choose OK, and a default R Markdown project will be created. The project contains a YAML header, a code block to set global options for the project, a markdown block with default explanatory text, a default R code block, a second default markdown and code block, and a final markdown block explaining the echo option in the previous code block.

At this point you're ready to start building your project. Usually, this involves editing the first R code block to set any project global options you need, changing the text in the first markdown block to replace the default text with something appropriate to your project, then erasing everything after the first R code block. The first code block can then be modified to perform the operations you want to execute at the beginning of your project.

R Code Blocks

R code blocks are surrounded by triple apostropheses ``` and ```. Between these delimiters you first name the code block and specify any options specific to the block, then enter the code to execute within the block.

```{r code-block-name, options} code ``` ... ```{r starwars} lib(tidyverse) tib <- starwars %>% filter( mass > 100 ) print( tib ) ```

Executing this block would produce the following output block.

A tibble: 10 × 14
name
<chr>
height
<int>
mass
<dbl>
hair_color
<chr>
skin_color
<chr>
eye_color
<chr>
birth_year
<dbl>
Darth Vader 202 136 none white yellow 41.9
Owen Lars 178 120 brown, grey light blue 52.0
Chewbacca 228 112 brown unknown blue 200.0
Jabba Desilijic Tiure 175 1358 NA green-tan, brown orange 600.0
Jek Tono Porkins 180 110 brown fair blue NA
IG-88 200 140 none metal red 15.0
Bossk 190 113 none green red 53.0
Dexter Jettster 198 102 none brown yellow NA
Grievous 216 159 none brown, white green, yellow NA
Tarfful 234 136 brown brown blue NA
1-10 of 10 rows | 1-7 of 14 columns

Like Jupyter Notebook, R Markdown code blocks maintain notebook state. This means you could load tidyverse in one code block, then use it in a subsequent block. However, this also means if you change a code block you need to re-run it for that change to take effect and propegate to follow-on code blocks. A common problem programmers have is remembering which code blocks they've changed, or the fact that the code block they're running isn't working because a previous code block was updated but not re-run. R will not tell you this. Consider, for example, an initial code block that loads data into a tibble from a file. This tibble is used in follow-on code blocks. If you change the file used to populate the tibble but forget to execute the code block, follow-on code blocks will continue to work, but they will be using old data because the tibble's contents has not been updated. Take care to make sure you're working in an up-to-date environment when using R Markdown.