A. Atomic vectors

We have already introduced the atomic vector data structure. In R, accessing vector elements can be done in several ways that are all useful for data manipulation.

Vector indices

We already know how to access a single element by its index (starts at 1 in R):

ages <- c(23,24,45,21,34,65,43,77,12,14,24)
ages[3]
## [1] 45

We can use a vector of interger values to provide multiple indices at once, i.e., to extract a sub-vector:

ages[c(2,5,7)]
## [1] 24 34 43
indices <- c(2,5,7)
ages[indices]
## [1] 24 34 43
2:6
## [1] 2 3 4 5 6
ages[2:6]
## [1] 24 45 21 34 65
4:1
## [1] 4 3 2 1
ages[4:1]
## [1] 21 45 24 23

The returned sub-vectors can be used like any vector:

ages[4:7]
## [1] 21 34 65 43
mean(ages[4:7])
## [1] 40.75

We can also extract a sub-vector by specifying which elements we do not want to keep. This is done with negative indices:

ages[-4]
##  [1] 23 24 45 34 65 43 77 12 14 24
ages[-(5:7)]
## [1] 23 24 45 21 77 12 14 24

Named vectors

The next option is to name each vector element and to use such names to retrieve elements (like in a hash table). The names associated to a vector are read and assigned through the function names():

names(ages) # no names by default
## NULL
names(ages) <- c("ada","eric","andre","paul","denis","claire","marc","mike","emily","hank","frank")
names(ages)
##  [1] "ada"    "eric"   "andre"  "paul"   "denis"  "claire" "marc"   "mike"  
##  [9] "emily"  "hank"   "frank"
ages # we note that named vectors are display in a specific manner
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##     23     24     45     21     34     65     43     77     12     14     24
ages["eric"]
## eric 
##   24
ages[2] # still available!
## eric 
##   24
ages[c("paul","claire")]
##   paul claire 
##     21     65

Logical vectors as indices

A very convenient way of selecting elements is to apply a logical condition and to take those elements that fulfill this condition only. For instance:

ages>20 & ages<45
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##   TRUE   TRUE  FALSE   TRUE   TRUE  FALSE   TRUE  FALSE  FALSE  FALSE   TRUE
ages[ages>20 & ages<45]
##   ada  eric  paul denis  marc frank 
##    23    24    21    34    43    24
r <- rnorm(length(ages))
ind <- r>0
ages[ind]
##   eric  andre   paul claire   marc   mike   hank 
##     24     45     21     65     43     77     14

NA values

In statistics, it is common to have missing values or observations. There is a special value denoted NA for any data type that is meant to represent such missing data points. NA can be assigned to a scalar variable, but is is more frequent to use it in structures as vectors that typically contain a number of observations. Some functions behave in a special fashion in the presence of NA and it is also possible to identify NA’s to remove them:

ages
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##     23     24     45     21     34     65     43     77     12     14     24
mean(ages)
## [1] 34.72727
ages <- c(ages,c(NA,34,NA))
ages
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##     23     24     45     21     34     65     43     77     12     14     24 
##                      
##     NA     34     NA
hist(ages) # hist ignores the NAs

mean(ages) # returns NA because it cannot compute with missing data
## [1] NA
mean(ages, na.rm=TRUE) # unless it is said to ignore them explicitly
## [1] 34.66667
is.na(ages)
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
##                      
##   TRUE  FALSE   TRUE
!is.na(ages)
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE 
##                      
##  FALSE   TRUE  FALSE
ages[!is.na(ages)]
##    ada   eric  andre   paul  denis claire   marc   mike  emily   hank  frank 
##     23     24     45     21     34     65     43     77     12     14     24 
##        
##     34
ages <- ages[!is.na(ages)] # get rid of the NAs

Sets

R does not offer a proper data type or structure to represent sets. It is rather done by using vectors and a couple of functions as well as the belong to operator %in%. Examples:

A <- c(3,5,23,12,15,6,9,0)
B <- c(5,2,7,8,-1,0)
union(A,B) # A U B
##  [1]  3  5 23 12 15  6  9  0  2  7  8 -1
intersect(A,B) # A Ո B
## [1] 5 0
setdiff(A,B) # A \ B
## [1]  3 23 12 15  6  9
27 %in% A
## [1] FALSE
12 %in% A
## [1] TRUE
c(1,2,3,4) %in% B
## [1] FALSE  TRUE FALSE FALSE

Since regular atomic vectors are used to represent sets, nothing prevents you to repeat multiple times a value in a set (which is no longer a set then…). To somehow correct for this, every time union(), insersect(), or setdiff() are used, all multiple occurrences of a single value are replaced by a single occurrence. If needed, you can eliminate multiple occurrences by yourself using unique().

C <- c(1,1,2,2,3,3)
C
## [1] 1 1 2 2 3 3
intersect(A,C)
## [1] 3
unique(C)
## [1] 1 2 3


B. Lists

The list data structure enables us to combine any number of data of any types. It is frequently used by functions that must return multiple values: they simply return a list.

ml <- list(23,"abc",-45.67)
ml # three slots, one for each value
## [[1]]
## [1] 23
## 
## [[2]]
## [1] "abc"
## 
## [[3]]
## [1] -45.67
names(ml) <- c("premier","second","troisieme") # slots can get names
ml
## $premier
## [1] 23
## 
## $second
## [1] "abc"
## 
## $troisieme
## [1] -45.67
ml2 <- list(un=1, deux="deux", trois=5:8) # names can be given at construction and list elements can be vectors
ml2
## $un
## [1] 1
## 
## $deux
## [1] "deux"
## 
## $trois
## [1] 5 6 7 8
h <- hist(ages, plot=FALSE)
h
## $breaks
## [1] 10 20 30 40 50 60 70 80
## 
## $counts
## [1] 2 4 2 2 0 1 1
## 
## $density
## [1] 0.016666667 0.033333333 0.016666667 0.016666667 0.000000000 0.008333333
## [7] 0.008333333
## 
## $mids
## [1] 15 25 35 45 55 65 75
## 
## $xname
## [1] "ages"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

To accessing data stored in lists, we must distinguish between the slots and their content. Single squared brackets enables us to extract sub-lists (we stay at the slot level):

ml2[3]
## $trois
## [1] 5 6 7 8
ml2["trois"]
## $trois
## [1] 5 6 7 8
ml2[1:2] # the options we have seen to access vector elements apply here as well
## $un
## [1] 1
## 
## $deux
## [1] "deux"
ml2[c("un","trois")]
## $un
## [1] 1
## 
## $trois
## [1] 5 6 7 8

To access the content of a slot, we use double squared brackets or the dollar sign:

ml2["deux"]
## $deux
## [1] "deux"
ml2[["deux"]]
## [1] "deux"
ml2[[3]]
## [1] 5 6 7 8
ml2[["trois"]][2] # [[]] accesses the slot content, which turns out to be a vector, and [] access a specific element of this vector
## [1] 6
ml2$un
## [1] 1
ml2$trois
## [1] 5 6 7 8
ml2$trois[2]
## [1] 6

We can have lists of lists, e.g., to represent trees or dendrograms, or other complex data.


C. Data frames

Data frame are the most commonly used data structure in R. Their purpose is to represent data tables.

Data frames are no really new data structures, they are lists of atomic vectors. Each vector represents a column of the data table, meaning that columns can have different types, but all the values in a same column must share the same data type.

Accessing elements with list and vector notations

Since data frames are nothing but lists of atomic vectors, we already know the syntax of element access operations:

df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
df
##       name salary code  tech
## 1   Pierre   3000    A  TRUE
## 2 Amandine   4000    A FALSE
## 3   Gudrun   8000    D FALSE
## 4 Dagobert   -500    E FALSE
typeof(df)
## [1] "list"
df[1] # data frame (sub-data frame) made of column 1 only
##       name
## 1   Pierre
## 2 Amandine
## 3   Gudrun
## 4 Dagobert
df[c(1,3)] # or 1 and 3
##       name code
## 1   Pierre    A
## 2 Amandine    A
## 3   Gudrun    D
## 4 Dagobert    E
df["tech"]
##    tech
## 1  TRUE
## 2 FALSE
## 3 FALSE
## 4 FALSE
df[[1]] # the content of column 1, i.e., a vector of type character
## [1] "Pierre"   "Amandine" "Gudrun"   "Dagobert"
df[["tech"]]
## [1]  TRUE FALSE FALSE FALSE
df[[1]][2] # access to elements
## [1] "Amandine"
df[[3]][[1]]
## [1] "A"
typeof(df[[1]])
## [1] "character"

Additional syntax to access data

The list notation df[[1]][2] to access the element at row 2, column 1 is obviously heavy. A simpler alternative is

df[2,1] # first index is the row and second index is the column (inverted compared to the list notation)
## [1] "Amandine"

The is also an alternative to access a whole column as a vector:

df[[2]]
## [1] 3000 4000 8000 -500
df[,2]
## [1] 3000 4000 8000 -500

We can also access a whole line BUT it remains a (1-row) data frame since, in general, data frames can have columns of different types (impossible to store as atomic vector):

df[3,]
##     name salary code  tech
## 3 Gudrun   8000    D FALSE

Row and column names of data frames

Data frames have column names that are accessed with names() since they just the slot names of the underlying list. Row names can be accessed and set through rownames():

rownames(df) # default values are given as 1,2,3,... (in characters because a row name must be of type character)
## [1] "1" "2" "3" "4"
rownames(df) <- c("lyon","geneve","paris","linz")
df
##            name salary code  tech
## lyon     Pierre   3000    A  TRUE
## geneve Amandine   4000    A FALSE
## paris    Gudrun   8000    D FALSE
## linz   Dagobert   -500    E FALSE
df["linz","code"]
## [1] "E"


D. Matrices

Two-dimensional tables of values can also be stored in a matrix. A matrix is a different data structure from a data frame since it is atomic (like vectors): all the data must be of the same type. Matrices are stored and accessed much more efficiently than data frames, and hence must be preferred for computations as soon as all the data have a single type. The list notations used to access data frame elements do not work with matrices, only [.,.] is valid:

mm <- matrix(rnorm(25), nrow=5)
mm
##             [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] -1.33874571  1.0507434  0.13215481  0.2020520 -0.3207511
## [2,]  0.10118142  1.3046875  0.05364647  0.1800103 -0.8700890
## [3,] -1.56624292  0.8947381  0.21988705 -0.2027143  0.1694511
## [4,]  1.08318449 -3.0320564 -0.26242785  0.9221311  0.2783416
## [5,] -0.03169732 -0.0165579 -2.12568722 -1.3443351 -1.3209520
colnames(mm) <- LETTERS[1:5] # colnames() also works with data frames but names() does not with matrices
rownames(mm) <- c("suisse","inde","chine","vietnam","autriche")
mm
##                    A          B           C          D          E
## suisse   -1.33874571  1.0507434  0.13215481  0.2020520 -0.3207511
## inde      0.10118142  1.3046875  0.05364647  0.1800103 -0.8700890
## chine    -1.56624292  0.8947381  0.21988705 -0.2027143  0.1694511
## vietnam   1.08318449 -3.0320564 -0.26242785  0.9221311  0.2783416
## autriche -0.03169732 -0.0165579 -2.12568722 -1.3443351 -1.3209520
mm[,2]
##     suisse       inde      chine    vietnam   autriche 
##  1.0507434  1.3046875  0.8947381 -3.0320564 -0.0165579
mm[4,] # this time a row is returned as a vector because it is atomic (not like data frames)
##          A          B          C          D          E 
##  1.0831845 -3.0320564 -0.2624278  0.9221311  0.2783416
mm["chine",c("B","E")]
##         B         E 
## 0.8947381 0.1694511
mm[c("chine","suisse"),c("B","E")]
##                B          E
## chine  0.8947381  0.1694511
## suisse 1.0507434 -0.3207511
mm[mm[,3]>0,]
##                 A         B          C          D          E
## suisse -1.3387457 1.0507434 0.13215481  0.2020520 -0.3207511
## inde    0.1011814 1.3046875 0.05364647  0.1800103 -0.8700890
## chine  -1.5662429 0.8947381 0.21988705 -0.2027143  0.1694511