---
title: "Vectors, lists and data frames"
author: "Jacques Colinge"
date: "11/29/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

\

## A. Atomic vectors

We have already introduced the atomic vector data structure. In R, accessing vector elements can be done in several ways that are all useful for data manipulation.

### Vector indices

We already know how to access a single element by its index (starts at 1 in R):
```{r}
ages <- c(23,24,45,21,34,65,43,77,12,14,24)
ages[3]
```

We can use a vector of interger values to provide multiple indices at once, *i.e.*, to extract a sub-vector:
```{r}
ages[c(2,5,7)]
indices <- c(2,5,7)
ages[indices]
2:6
ages[2:6]
4:1
ages[4:1]
```

The returned sub-vectors can be used like any vector:
```{r}
ages[4:7]
mean(ages[4:7])
```

We can also extract a sub-vector by specifying which elements we do not want to keep. This is done with negative indices:
```{r}
ages[-4]
ages[-(5:7)]
```

### Named vectors

The next option is to name each vector element and to use such names to retrieve elements (like in a hash table). The names associated to a vector are read and assigned through the function ```names()```:
```{r}
names(ages) # no names by default
names(ages) <- c("ada","eric","andre","paul","denis","claire","marc","mike","emily","hank","frank")
names(ages)
ages # we note that named vectors are display in a specific manner
ages["eric"]
ages[2] # still available!
ages[c("paul","claire")]
```

### Logical vectors as indices

A very convenient way of selecting elements is to apply a logical condition and to take those elements that fulfill this condition only. For instance:
```{r}
ages>20 & ages<45
ages[ages>20 & ages<45]
r <- rnorm(length(ages))
ind <- r>0
ages[ind]
```

### NA values

In statistics, it is common to have missing values or observations. There is a special value denoted ```NA``` for any data type that is meant to represent such missing data points. ```NA``` can be assigned to a scalar variable, but is is more frequent to use it in structures as vectors that typically contain a number of observations. Some functions behave in a special fashion in the presence of ```NA``` and it is also possible to identify ```NA```'s to remove them:
```{r}
ages
mean(ages)
ages <- c(ages,c(NA,34,NA))
ages
hist(ages) # hist ignores the NAs
mean(ages) # returns NA because it cannot compute with missing data
mean(ages, na.rm=TRUE) # unless it is said to ignore them explicitly
is.na(ages)
!is.na(ages)
ages[!is.na(ages)]
ages <- ages[!is.na(ages)] # get rid of the NAs
```

### Sets

R does not offer a proper data type or structure to represent sets. It is rather done by using vectors and a couple of functions as well as the *belong to* operator ```%in%```. Examples:
```{r}
A <- c(3,5,23,12,15,6,9,0)
B <- c(5,2,7,8,-1,0)
union(A,B) # A U B
intersect(A,B) # A Ո B
setdiff(A,B) # A \ B
27 %in% A
12 %in% A
c(1,2,3,4) %in% B
```

Since regular atomic vectors are used to represent sets, nothing prevents you to repeat multiple times a value in a set (which is no longer a set then...). To somehow correct for this, every time ```union()```, ```insersect()```, or ```setdiff()``` are used, all multiple occurrences of a single value are replaced by a single occurrence. If needed, you can eliminate multiple occurrences by yourself using ```unique()```.
```{r}
C <- c(1,1,2,2,3,3)
C
intersect(A,C)
unique(C)
```

\

## B. Lists

The list data structure enables us to combine any number of data of any types. It is frequently used by functions that must return multiple values: they simply return a list.
```{r}
ml <- list(23,"abc",-45.67)
ml # three slots, one for each value
names(ml) <- c("premier","second","troisieme") # slots can get names
ml
ml2 <- list(un=1, deux="deux", trois=5:8) # names can be given at construction and list elements can be vectors
ml2
h <- hist(ages, plot=FALSE)
h
```

To accessing data stored in lists, we must distinguish between the slots and their content. Single squared brackets enables us to extract sub-lists (we stay at the slot level):
```{r}
ml2[3]
ml2["trois"]
ml2[1:2] # the options we have seen to access vector elements apply here as well
ml2[c("un","trois")]
```

To access the content of a slot, we use double squared brackets or the dollar sign:
```{r}
ml2["deux"]
ml2[["deux"]]
ml2[[3]]
ml2[["trois"]][2] # [[]] accesses the slot content, which turns out to be a vector, and [] access a specific element of this vector
ml2$un
ml2$trois
ml2$trois[2]
```

We can have lists of lists, *e.g.*, to represent trees or dendrograms, or other complex data.

\

## C. Data frames

Data frame are the most commonly used data structure in R. Their purpose is to represent data tables.

Data frames are no really new data structures, they are lists of atomic vectors. Each vector represents a column of the data table, meaning that columns can have different types, but all the values in a same column must share the same data type.

### Accessing elements with list and vector notations

Since data frames are nothing but lists of atomic vectors, we already know the syntax of element access operations:
```{r}
df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
df
typeof(df)
df[1] # data frame (sub-data frame) made of column 1 only
df[c(1,3)] # or 1 and 3
df["tech"]
df[[1]] # the content of column 1, i.e., a vector of type character
df[["tech"]]
df[[1]][2] # access to elements
df[[3]][[1]]
typeof(df[[1]])
```

### Additional syntax to access data

The list notation ```df[[1]][2]``` to access the element at row 2, column 1 is obviously heavy. A simpler alternative is
```{r}
df[2,1] # first index is the row and second index is the column (inverted compared to the list notation)
```

The is also an alternative to access a whole column as a vector:
```{r}
df[[2]]
df[,2]
```

We can also access a whole line BUT it remains a (1-row) data frame since, in general, data frames can have columns of different types (impossible to store as atomic vector):
```{r}
df[3,]
```

### Row and column names of data frames

Data frames have column names that are accessed with ```names()``` since they just the slot names of the underlying list. Row names can be accessed and set through ```rownames()```:
```{r}
rownames(df) # default values are given as 1,2,3,... (in characters because a row name must be of type character)
rownames(df) <- c("lyon","geneve","paris","linz")
df
df["linz","code"]
```

\

## D. Matrices

Two-dimensional tables of values can also be stored in a matrix. A matrix is a different data structure from a data frame since it is atomic (like vectors): all the data must be of the same type. Matrices are stored and accessed much more efficiently than data frames, and hence must be preferred for computations as soon as all the data have a single type. The list notations used to access data frame elements do not work with matrices, only ```[.,.]``` is valid:
```{r}
mm <- matrix(rnorm(25), nrow=5)
mm
colnames(mm) <- LETTERS[1:5] # colnames() also works with data frames but names() does not with matrices
rownames(mm) <- c("suisse","inde","chine","vietnam","autriche")
mm
mm[,2]
mm[4,] # this time a row is returned as a vector because it is atomic (not like data frames)
mm["chine",c("B","E")]
mm[c("chine","suisse"),c("B","E")]
mm[mm[,3]>0,]
```