---
title: "Introduction to R"
output:
pdf_document: default
html_document: default
---
---
### Getting help
R provides help with function and commands. On-line help gives useful information as well. Getting used to R help is key to successful statistical modelling. The online help can be accessed in HTML format by typing:
```{r}
help.start()
```
A keyword search is possible using the Search Engine and Keywords link. You can also use the help() or ? functions. For example, if we want to know how to use the matrix() function, the following two commands are equivalents:
```{r}
help(matrix)
? matrix
```
The `str(object.name)` command is used to display the internal structure of an R object. The `summary(object.name)` command gives a summary of an object, usually a statistical summary but it is generic meaning it has different operations for different classes of object.
```{r}
dir() # show files in the current directory
ls.str() # str() for each variable in the search path
getwd() # is asking for the current working directory
```
When you quit, R will ask you if you want to save the workspace (that is, all of the variables you have defined in this session). Say “no” in order to avoid clutter.
Should an R command seem to be stuck or take longer than you’re willing to wait, type Control-C
---
### Calling Linux shell scripting commands
`system("...")` is used to call any linux scripting commands within the R environment
```{r}
system("pwd")
# is equivalent to
getwd()
```
---
### Inputs and outputs
Once you have opened an R session and eventually loaded the library you need, you can start exploring your data
*Loading data*
`load(file.name)` function, loads R type datasets written with the save function
*Saving data*
`save(object.name.1, object.name.2, ... )` function save the specified object in XDR platform independent binary format
*Reading tables*
`read.table("filename")` Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields within the file. Some useful arguments when reading in data are:
* The default separator `sep = ""` is any whitespace. You might need `sep = ","` or `";"` and so on.
* Use `header = TRUE` to read the first line as a header of column names.
* The `as.is = TRUE` specification is used to prevent character vectors from being converted to factors.
* The `comment.char = ""` specification is used to prevent "#" from being interpreted as a comment and use `skip = n` to skip n lines before reading data.
For more details:
```{r}
?read.table
```
Example:
```{r, eval = FALSE}
landuse04 <- read.csv("~/ost4sem/exercise/basic_adv_r/inputs/2004_landuse.csv", header=TRUE, sep=",", dec=".", na.string=":")
```
* `read.csv("filename")` : Used to read comma separated value (CSV) files.
* `read.delim("filename")` : Used for reading tab-delimited files
* `read.fwf()` : reads a table of fixed width formatted data into a ’data.frame’; widths is an integer vector, giving the widths of the fixed-width fields
```{r, eval = FALSE}
read.csv(file.name, header = TRUE, sep = ",", quote="\"", dec=".", fill =TRUE, comment.char="", ...)
```
*Show variables and data in your workspace*
The list function ls() outputs a list of existing R objects
```{r}
ls()
```
The structure function `str(object.name)` informs you of the structure of a specific object.
The summary function `summary(object.name)` informs you of basic statistics of a specific object.
*Save and remove data or R objects*
`save(file, ...)` saves the specified objects (...) in the XDR platform-independent binary format.
Example:
```{r, eval = FALSE}
save(landuse04, file="~/ost4sem/exercise/basic_adv_r/outputs/landuse2004")
```
`save.image(file)` saves all objects.
Example:
```{r, eval = FALSE}
save(file="~/ost4sem/exercise/basic_adv_r/outputs/landuse2004_and_more")
```
`rm(file, ...)` removes the object you created or data you uploaded.
```{r, eval = FALSE}
rm(landuse04)
```
No objects are present in memory now, use the `ls()` function to check it.
```{r}
ls()
```
But since you saved the landuse2004 data you can reload it using the `load()` function and check its structrure using the `str()` function.
```{r, eval = FALSE}
load("~/ost4sem/exercise/basic_adv_r/outputs/landuse2004")
str(landuse04) # as a result you will see the following information:
```
---
### Variables and calculations
R has an interactive calculations function. The command is executed and results are displayed.
R uses: `+`, `-`, `/`, and `^` for addition, subtraction, multiplication, division and exponentiation, respectively.
```{r}
2+2
```
The [1] at the beginning of the line is just R printing an index of element numbers. If you print a result that appears on multiple lines, R will put an index at the beginning of each line.
```{r}
2*5
```
```{r}
10/2
```
```{r}
2*5
```
```{r}
2^3
```
*Variable settings*
You can simply create a variable by typing: `variable name = function, constant or calculation`.
```{r}
x = 3*2
```
The results of 3*2 is not displayed. In fact, the `x` variable value is stored in the memory without printing it. To display the `x` value you can use:
```{r}
print(x)
# or
x
```
Most users apply a similar syntax using the `<-` character string instead of the `=` character.
```{r}
x <- 3
x
```
Also remember that R is case sensitive, print(X) or X is different from x. For instance
```{r, eval = FALSE}
a <- 3
a
A
```
Variable names in R must begin with a letter, followed by alphanumeric characters.
```{r, eval = FALSE}
3e <- 2
```
In long names you can use `.` or `_` as in:
`very.long.variable.name.X` or `very_long_variable_name_Y`.
But you can’t use blank spaces in variable names.
Some schools of R users frown upon use of `.` in variable names, as the `.` has a special meaning in some functions. Using an underscore is a safer bet.
Avoid single letter names such us: c, l, q, t, C, D, F, I, and T, which are either built-in R functions or hard to tell apart.
```{r}
very_long_variable_name_X3 <- 3
very_long_variable_name_X3
```
*Interactive calculations*
Once defined, you can use variables in interactive calculations :
```{r}
b = 2*2
a = 2*3
a*b
```
And you can use variables in formulas:
```{r}
c = 60 /(a+b)
c
```
Typing `a;b` you can display a and b variables at the same time:
```{r}
a;b
```
If you forget to close a parentheses (in an R script or the console), R will display a `+` sign. In R Markdown, it will give an error.
For example, try running `c = 60 /(a+b` in the console.
In this case you can either close the parenthesis in the next line or type ctrl + c to go back to a new starting prompt.
*Order of operations*
When using more complex formulas be aware of the importance of the order of operators. Parenthesis have priority over exponentiation, or powers, then comes multiplication and division, finally addition and subtraction. The following command:
```{r}
C = ((a + 2 * sqrt(b))/(a + 8 * sqrt(b)))/2
C
# Is different from:
C = a + 2 * sqrt(b) / a + 8 * sqrt(b) / 2
C
```
As well as
```{r}
100-40/2^4
# versus
(100-40)/2^4
```
And:
```{r}
-2^4
# is different from:
(-2)^4
```
---
### Logical values
R can perform conditional tests and generate True or False values as results. The logical operators are `<`, `<=`, `>`, `>=`, `==` for exact equality and `!=` for inequality.
```{r}
x = 60
x > 100
x == 70
x > 3
x != 100
```
Logical values can be stored as variables:
```{r}
x <- 60
logical_value <- x > 3
logical_value
```
In addition if `c1` and `c2` are logical expressions, then `c1 & c2` is their intersection (“and”), `c1 | c2` is their union (“or”), and `!c1` is the negation of c1.
---
### R objects
The entities R operates on are technically known as objects. Examples are:
* __vectors of numeric__ (real) or __complex values__,
* __vectors of logical__ values, and
* __vectors of character strings__.
These are known as “atomic” structures since their components are all of the same type, or mode, namely numeric, complex, logical, character and raw.
R also operates on objects called __lists__, which are of mode list. These are ordered sequences of objects which individually can be of any mode. Lists are known as “recursive” rather than atomic structures since their components can themselves be lists in their own right.
The other recursive structures are those of mode __function__ and __expression__. Functions are the objects that form part of the R system along with similar user written functions, which we will discuss in some detail later. Expressions as objects form an advanced part of R which will not be discussed in this guide, except indirectly when we discuss formulae used with modeling in R.
By the __mode__ of an object we mean the basic type of its fundamental constituents. This is a special case of a “property” of an object. Another property of every object is its __length__. The functions `mode(object)` and `length(object)` can be used to find out the mode and length of any defined structure 10.
Further properties of an object are usually provided by `attributes(object)`, (see 'Getting and setting attributes'). Because of this, mode and length are also called “intrinsic attributes” of an object. For example, if z is a complex vector of length 100, then in an expression mode(z) is the character string "complex" and length(z) is 100.
---
### Vectors
Vectors are combinations of scalars in a string structure. Vectors must have all values of the same mode. Thus any given vector must be unambiguously either logical, numeric, complex, character or raw. (The only apparent exception to this rule is the special “value” listed as `NA` for quantities not available, but in fact there are several types of NA). Note that a vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0).
`c(...)` is the generic function to combine arguments with the default forming a vector. Using the `RECURSIVE=TRUE` option allows one to descend through lists combining all elements into one vector.
To see details for the generic function c(...) and combine arguments forming a vector:
```{r}
? c
```
As an example we can create a simple vector of seven values typing:
```{r}
c(2,3,4,5,10,5,8)
```
We can generate a sequence using the syntax `:`
```{r}
1:10
```
We can generate the same sequence of 1:10 command using the `seq()` function. The syntax will be :
```{r}
seq(1,10)
```
The `seq()` function `seq(from = number, to = number, by = number)` creates a vector starting from a value to another by a defined increment:
```{r}
seq(1, 10, 0.25)
seq(from = 1, to = 10, by = 0.25)
```
The replicate function `rep(x, n)` enables you to replicate a vector several times in a more complex vector.
Calculations can be included to form vectors as well and functions can be combined in the same command:
```{r}
oneTwoThree <- 1:3
rep(oneTwoThree, 10)
c(10*0:10)
rep(c (5*40:1, 5*1:40, 5, 6, 7, 8, 3, 2001:2014), 2)
rep(seq(1, 3, 0.5), 3)
```
---
### Missing Values
In some cases the components of a vector or of an R object in general may not be completely known. When an element or value is “not available” or is a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value `NA`.
Any operation on an `NA` becomes an `NA`.
The function `is.na(x)` gives a logical vector of the same size as x with value `TRUE` if and only if the corresponding element in x is `NA`.
```{r}
z <- c(1:3, NA)
ind <- is.na(z)
ind
```
There is a second kind of “missing” values which are produced by numerical computation, the so-called "Not a Number", `NaN`, values. Examples are `0/0` or `Inf - Inf` which both give `NaN` since the result cannot be defined sensibly.
```{r}
Inf-Inf
0/0
```
In summary, `is.na(xx)` is TRUE both for `NA` and `NaN` values. To differentiate these, `is.nan(xx)` is only TRUE for NaNs. Missing values are sometimes printed as when character vectors are printed without quotes.
```{r}
z <- c(1:3,NA)
is_not_available <- is.na(z)
is_not_a_number <- is.nan(z)
is_not_a_number
is_not_available
```
---
### Matrices
Matrices, or more generally arrays, are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in a special way. See Arrays and matrices.
* Factors provide compact ways to handle categorical data. See Factors.
* Lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation. See Lists.
The `matrix()` function creates a matrix from the given set of values. We use the `matrix(x, nrow=, ncol=)` function to set the matrix cell values, the number of rows and the number of columns. We can use the `colnames()` and `rownames()` functions to set the column and row names of the matrix-like object.
```{r}
matrix(data = NA, nrow = 2, ncol = 3)
example_matrix <- matrix(0,2,3)
example_matrix
example_matrix[1,]
example_matrix[,2]
example_matrix[1,] <- 1:3
example_matrix[2,] <- c(5,10,4)
example_matrix
matrix_head <- c("col a","col b","column c")
matrix_side <- c("first raw","second raw")
str(matrix_side)
```
When using `" "` we create and refer to a character type `chr` input
```{r}
numeric_vector <- c(rep(c (5*10:1, 5, 6), 2))
character_vector <- as.character(numeric_vector)
str(character_vector)
colnames(example_matrix) <- matrix_head
rownames(example_matrix) <- matrix_side
example_matrix
str(example_matrix)
```
---
### Arrays
An array can be considered a multiple subscripted collection of data entries, for example numeric. R allows simple facilities for creating and handling arrays, and in particular the special case of matrices.
As well as giving a vector structure a `dim` attribute, arrays can be constructed from vectors by the array function, which has the form `array(data_vector, dim_vector)`
```{r}
Z <- array(1:24, c(3,4,2))
Z
```
---
### Data Frames
Data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as data matrices with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric.
As a result R dataframes are tightly coupled collections of variables which share many of the properties of matrices and of lists. Data frames are used as the fundamental data structure by most of R's modeling software.
A data frame is a list with class "data.frame". There are restrictions on lists that may be made into data frames, namely:
* The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames.
* Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively.
* Numeric vectors, logicals and factors are included, and character vectors are coerced to be factors, whose levels are the unique values appearing in the vector.
* Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.
See:
```{r}
? data.frame
```
To construct a dataframe:
```{r}
my_data_frame = data.frame(v=1:4,ch=c("a","b","c","d"),n=10)
my_data_frame
```
Or:
```{r}
my_data_frame = data.frame(vector = 1:4,
character = c("a","b","c","d"),
const.vector = 10,
row.names = c("data1","data2","data3","data4"))
my_data_frame
```
---
### Data selection and manipulation
You can extract data from dataframes using the `[ [ ] ]` and `$` sign:
```{r}
my_data_frame[["character"]]
```
```{r}
my_data_frame[[2]]
```
Call the 3rd value of the character vector :
```{r}
my_data_frame[[2]][3]
```
Or using the `$` syntax:
```{r}
my_data_frame$vector
```
```{r}
my_data_frame$character[2:3]
```
You can add single arguments to a data frame, query information, select and manipulate arguments or single values from a dataframe
```{r}
my_data_frame$new
```
```{r}
my_data_frame$new = c(10,11,20,40)
my_data_frame
```
```{r}
str(my_data_frame)
```
`length(object.name)` will return the number of elements in an object such as a matrix vector or dataframes:
```{r}
length(my_data_frame$new)
```
`which(object.name)` and `which.max(object.name)` return the index of a specific or of the greatest element of an object.
```{r}
which.max(my_data_frame$new)
```
```{r}
which(my_data_frame$new==20)
```
`max(object.name)` will return the value of the greatest element.
```{r}
max(my_data_frame$new)
```
`sort(object.name)`` will sort from small to big:
```{r}
sort(my_data_frame$new)
```
`rev(object.name)` from big to small:
```{r}
rev(sort(my_data_frame$new))
```
`subset(object.name, ...)` -- the subset function returns a selection of an R-object with respect to criteria (typically comparisons: x$V1 < 10). If the R-object is a data frame, the option `select` gives the variables to be kept or dropped using a minus sign.
```{r}
subset(my_data_frame, my_data_frame$new == 20)
```
We can use the `sample()` function to sample from a vector or set of values
```{r}
sample(my_data_frame$new, 3)
```
```{r}
sample(my_data_frame$new, 3)
```
```{r}
sample(my_data_frame$new, 3)
```
---
### More examples
The following R commands give an example of the simple procedure of:
1. Importing data,
2. Cleaning a table by extracting relevant information, and
3. Checking the presence of missing data.
```{r, eval = FALSE}
landuse04 <- read.csv("~/ost4sem/studycase/Lab_scripts/inputs/2004_landuse.csv",
header = TRUE, sep = ",", dec = ".", na.string = ":")
forests04 <- subset(landuse04, landuse04$forest.Wooded.area >= 0 )
forests04$landuse <- NULL
forests04_check <- na.fail(forests04)
forests04$total.Total.area[1] <- NA
forests04_check <- na.fail(forests04)
```
You will find that the last line fails given missing values in the dataset.
We can resolve the situation at the beginning with no NA
```{r, eval = FALSE}
forests04 <- subset(landuse04, landuse04$forest.Wooded.area >=0 )
forests04$landuse <- NULL
forests04_check <- na.fail(forests04)
str(forests04)
```
Do you see something strange? look at the `forests04$geographic.Unit` level of factors and the dataframe number of variables!
Let's fix it now!
```{r, eval = FALSE}
library(gdata)
forests04 <- drop.levels(forests04)
str(forests04)
```
---
### Functions
Functions are themselves objects in R which can be stored in the project's workspace. This provides a simple and convenient way to extend R.
Usage: in writing your own function you provide one or more arguments or names for the function, an expression (or body of the function) and a value is produced equal to the output function result
`function(arglist) expr function definition`
`return(value)`
Example:
```{r}
myfunction <- function(x) x^5
myfunction(3)
```
```{r}
body(myfunction) <- quote(5^x)
```
Or equivalently
```{r}
body(myfunction) <- expression(5^x)
myfunction(3)
```
```{r}
body(myfunction)
5^x
myfunction
```