We have already introduced the atomic vector data structure. In R, accessing vector elements can be done in several ways that are all useful for data manipulation.
We already know how to access a single element by its index (starts at 1 in R):
ages <- c(23,24,45,21,34,65,43,77,12,14,24)
ages[3]
## [1] 45
We can use a vector of interger values to provide multiple indices at once, i.e., to extract a sub-vector:
ages[c(2,5,7)]
## [1] 24 34 43
indices <- c(2,5,7)
ages[indices]
## [1] 24 34 43
2:6
## [1] 2 3 4 5 6
ages[2:6]
## [1] 24 45 21 34 65
4:1
## [1] 4 3 2 1
ages[4:1]
## [1] 21 45 24 23
The returned sub-vectors can be used like any vector:
ages[4:7]
## [1] 21 34 65 43
mean(ages[4:7])
## [1] 40.75
We can also extract a sub-vector by specifying which elements we do not want to keep. This is done with negative indices:
ages[-4]
## [1] 23 24 45 34 65 43 77 12 14 24
ages[-(5:7)]
## [1] 23 24 45 21 77 12 14 24
The next option is to name each vector element and to use such names
to retrieve elements (like in a hash table). The names associated to a
vector are read and assigned through the function
names()
:
names(ages) # no names by default
## NULL
names(ages) <- c("ada","eric","andre","paul","denis","claire","marc","mike","emily","hank","frank")
names(ages)
## [1] "ada" "eric" "andre" "paul" "denis" "claire" "marc" "mike"
## [9] "emily" "hank" "frank"
ages # we note that named vectors are display in a specific manner
## ada eric andre paul denis claire marc mike emily hank frank
## 23 24 45 21 34 65 43 77 12 14 24
ages["eric"]
## eric
## 24
ages[2] # still available!
## eric
## 24
ages[c("paul","claire")]
## paul claire
## 21 65
A very convenient way of selecting elements is to apply a logical condition and to take those elements that fulfill this condition only. For instance:
ages>20 & ages<45
## ada eric andre paul denis claire marc mike emily hank frank
## TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
ages[ages>20 & ages<45]
## ada eric paul denis marc frank
## 23 24 21 34 43 24
r <- rnorm(length(ages))
ind <- r>0
ages[ind]
## eric andre paul claire marc mike hank
## 24 45 21 65 43 77 14
In statistics, it is common to have missing values or observations.
There is a special value denoted NA
for any data type that
is meant to represent such missing data points. NA
can be
assigned to a scalar variable, but is is more frequent to use it in
structures as vectors that typically contain a number of observations.
Some functions behave in a special fashion in the presence of
NA
and it is also possible to identify NA
’s to
remove them:
ages
## ada eric andre paul denis claire marc mike emily hank frank
## 23 24 45 21 34 65 43 77 12 14 24
mean(ages)
## [1] 34.72727
ages <- c(ages,c(NA,34,NA))
ages
## ada eric andre paul denis claire marc mike emily hank frank
## 23 24 45 21 34 65 43 77 12 14 24
##
## NA 34 NA
hist(ages) # hist ignores the NAs
mean(ages) # returns NA because it cannot compute with missing data
## [1] NA
mean(ages, na.rm=TRUE) # unless it is said to ignore them explicitly
## [1] 34.66667
is.na(ages)
## ada eric andre paul denis claire marc mike emily hank frank
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##
## TRUE FALSE TRUE
!is.na(ages)
## ada eric andre paul denis claire marc mike emily hank frank
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##
## FALSE TRUE FALSE
ages[!is.na(ages)]
## ada eric andre paul denis claire marc mike emily hank frank
## 23 24 45 21 34 65 43 77 12 14 24
##
## 34
ages <- ages[!is.na(ages)] # get rid of the NAs
R does not offer a proper data type or structure to represent sets.
It is rather done by using vectors and a couple of functions as well as
the belong to operator %in%
. Examples:
A <- c(3,5,23,12,15,6,9,0)
B <- c(5,2,7,8,-1,0)
union(A,B) # A U B
## [1] 3 5 23 12 15 6 9 0 2 7 8 -1
intersect(A,B) # A Ո B
## [1] 5 0
setdiff(A,B) # A \ B
## [1] 3 23 12 15 6 9
27 %in% A
## [1] FALSE
12 %in% A
## [1] TRUE
c(1,2,3,4) %in% B
## [1] FALSE TRUE FALSE FALSE
Since regular atomic vectors are used to represent sets, nothing
prevents you to repeat multiple times a value in a set (which is no
longer a set then…). To somehow correct for this, every time
union()
, insersect()
, or
setdiff()
are used, all multiple occurrences of a single
value are replaced by a single occurrence. If needed, you can eliminate
multiple occurrences by yourself using unique()
.
C <- c(1,1,2,2,3,3)
C
## [1] 1 1 2 2 3 3
intersect(A,C)
## [1] 3
unique(C)
## [1] 1 2 3
The list data structure enables us to combine any number of data of any types. It is frequently used by functions that must return multiple values: they simply return a list.
ml <- list(23,"abc",-45.67)
ml # three slots, one for each value
## [[1]]
## [1] 23
##
## [[2]]
## [1] "abc"
##
## [[3]]
## [1] -45.67
names(ml) <- c("premier","second","troisieme") # slots can get names
ml
## $premier
## [1] 23
##
## $second
## [1] "abc"
##
## $troisieme
## [1] -45.67
ml2 <- list(un=1, deux="deux", trois=5:8) # names can be given at construction and list elements can be vectors
ml2
## $un
## [1] 1
##
## $deux
## [1] "deux"
##
## $trois
## [1] 5 6 7 8
h <- hist(ages, plot=FALSE)
h
## $breaks
## [1] 10 20 30 40 50 60 70 80
##
## $counts
## [1] 2 4 2 2 0 1 1
##
## $density
## [1] 0.016666667 0.033333333 0.016666667 0.016666667 0.000000000 0.008333333
## [7] 0.008333333
##
## $mids
## [1] 15 25 35 45 55 65 75
##
## $xname
## [1] "ages"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
To accessing data stored in lists, we must distinguish between the slots and their content. Single squared brackets enables us to extract sub-lists (we stay at the slot level):
ml2[3]
## $trois
## [1] 5 6 7 8
ml2["trois"]
## $trois
## [1] 5 6 7 8
ml2[1:2] # the options we have seen to access vector elements apply here as well
## $un
## [1] 1
##
## $deux
## [1] "deux"
ml2[c("un","trois")]
## $un
## [1] 1
##
## $trois
## [1] 5 6 7 8
To access the content of a slot, we use double squared brackets or the dollar sign:
ml2["deux"]
## $deux
## [1] "deux"
ml2[["deux"]]
## [1] "deux"
ml2[[3]]
## [1] 5 6 7 8
ml2[["trois"]][2] # [[]] accesses the slot content, which turns out to be a vector, and [] access a specific element of this vector
## [1] 6
ml2$un
## [1] 1
ml2$trois
## [1] 5 6 7 8
ml2$trois[2]
## [1] 6
We can have lists of lists, e.g., to represent trees or dendrograms, or other complex data.
Data frame are the most commonly used data structure in R. Their purpose is to represent data tables.
Data frames are no really new data structures, they are lists of atomic vectors. Each vector represents a column of the data table, meaning that columns can have different types, but all the values in a same column must share the same data type.
Since data frames are nothing but lists of atomic vectors, we already know the syntax of element access operations:
df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
df
## name salary code tech
## 1 Pierre 3000 A TRUE
## 2 Amandine 4000 A FALSE
## 3 Gudrun 8000 D FALSE
## 4 Dagobert -500 E FALSE
typeof(df)
## [1] "list"
df[1] # data frame (sub-data frame) made of column 1 only
## name
## 1 Pierre
## 2 Amandine
## 3 Gudrun
## 4 Dagobert
df[c(1,3)] # or 1 and 3
## name code
## 1 Pierre A
## 2 Amandine A
## 3 Gudrun D
## 4 Dagobert E
df["tech"]
## tech
## 1 TRUE
## 2 FALSE
## 3 FALSE
## 4 FALSE
df[[1]] # the content of column 1, i.e., a vector of type character
## [1] "Pierre" "Amandine" "Gudrun" "Dagobert"
df[["tech"]]
## [1] TRUE FALSE FALSE FALSE
df[[1]][2] # access to elements
## [1] "Amandine"
df[[3]][[1]]
## [1] "A"
typeof(df[[1]])
## [1] "character"
The list notation df[[1]][2]
to access the element at
row 2, column 1 is obviously heavy. A simpler alternative is
df[2,1] # first index is the row and second index is the column (inverted compared to the list notation)
## [1] "Amandine"
The is also an alternative to access a whole column as a vector:
df[[2]]
## [1] 3000 4000 8000 -500
df[,2]
## [1] 3000 4000 8000 -500
We can also access a whole line BUT it remains a (1-row) data frame since, in general, data frames can have columns of different types (impossible to store as atomic vector):
df[3,]
## name salary code tech
## 3 Gudrun 8000 D FALSE
Data frames have column names that are accessed with
names()
since they just the slot names of the underlying
list. Row names can be accessed and set through
rownames()
:
rownames(df) # default values are given as 1,2,3,... (in characters because a row name must be of type character)
## [1] "1" "2" "3" "4"
rownames(df) <- c("lyon","geneve","paris","linz")
df
## name salary code tech
## lyon Pierre 3000 A TRUE
## geneve Amandine 4000 A FALSE
## paris Gudrun 8000 D FALSE
## linz Dagobert -500 E FALSE
df["linz","code"]
## [1] "E"
Two-dimensional tables of values can also be stored in a matrix. A
matrix is a different data structure from a data frame since it is
atomic (like vectors): all the data must be of the same type. Matrices
are stored and accessed much more efficiently than data frames, and
hence must be preferred for computations as soon as all the data have a
single type. The list notations used to access data frame elements do
not work with matrices, only [.,.]
is valid:
mm <- matrix(rnorm(25), nrow=5)
mm
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1.33874571 1.0507434 0.13215481 0.2020520 -0.3207511
## [2,] 0.10118142 1.3046875 0.05364647 0.1800103 -0.8700890
## [3,] -1.56624292 0.8947381 0.21988705 -0.2027143 0.1694511
## [4,] 1.08318449 -3.0320564 -0.26242785 0.9221311 0.2783416
## [5,] -0.03169732 -0.0165579 -2.12568722 -1.3443351 -1.3209520
colnames(mm) <- LETTERS[1:5] # colnames() also works with data frames but names() does not with matrices
rownames(mm) <- c("suisse","inde","chine","vietnam","autriche")
mm
## A B C D E
## suisse -1.33874571 1.0507434 0.13215481 0.2020520 -0.3207511
## inde 0.10118142 1.3046875 0.05364647 0.1800103 -0.8700890
## chine -1.56624292 0.8947381 0.21988705 -0.2027143 0.1694511
## vietnam 1.08318449 -3.0320564 -0.26242785 0.9221311 0.2783416
## autriche -0.03169732 -0.0165579 -2.12568722 -1.3443351 -1.3209520
mm[,2]
## suisse inde chine vietnam autriche
## 1.0507434 1.3046875 0.8947381 -3.0320564 -0.0165579
mm[4,] # this time a row is returned as a vector because it is atomic (not like data frames)
## A B C D E
## 1.0831845 -3.0320564 -0.2624278 0.9221311 0.2783416
mm["chine",c("B","E")]
## B E
## 0.8947381 0.1694511
mm[c("chine","suisse"),c("B","E")]
## B E
## chine 0.8947381 0.1694511
## suisse 1.0507434 -0.3207511
mm[mm[,3]>0,]
## A B C D E
## suisse -1.3387457 1.0507434 0.13215481 0.2020520 -0.3207511
## inde 0.1011814 1.3046875 0.05364647 0.1800103 -0.8700890
## chine -1.5662429 0.8947381 0.21988705 -0.2027143 0.1694511