---
title: "Operations on tables"
author: "Jacques Colinge"
date: "11/29/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

\

## A. Sizes

```{r}
v <- 12:27
v
length(v)
mm <- matrix(rnorm(35, mean=4, sd=2), nrow=5)
mm
dim(mm)
dim(mm)[1]
nrow(mm)
ncol(mm)
df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
dim(df)
ncol(df)
```

\

## B. Global numerical operations

Those functions work with both data frames and matrices provided they contain numerical values.
```{r}
sum(v) # vector version
sum(mm) # matrix version (full total)
rowSums(mm) # matrix version by row
colSums(mm)
rowMeans(mm)
colMeans(mm)
colMeans(mm, na.rm=TRUE) # also for global operations if needed
```

Sums of local values count the number of ```TRUE``` values:
```{r}
v>15
sum(v>15)
mm>0.5
sum(mm>0.5)
rowSums(mm>0.5)
```

We may need to transpose a matrix, *i.e.*, we exchange rows and columns (first row becomes first column, second row becomes second column, etc.):
```{r}
t(mm)
```

\

## C. Non predifined global operations

Although there are predefined function for sums or means by rows and columns, we often need to perform other operations by rows or columns. ```apply()``` enables us to submit each row or column of a data frame or a matrix to a predefined function or even a function of our own:
```{r}
apply(mm,1,sd) # we apply the standard deviation function to each row (1)
apply(mm,2,sd) # we apply the standard deviation function to each column (2)
CV <- apply(mm,1,sd)/rowMeans(mm) # coefficient of variation of each row
CV
```

To normalize gene transcript expression data, we have seen that we may want to divide each column of a matrix by a different value, *e.g.*, its total. That is, a vector of values, one per column, is provided and those values must be used for the corresponding columns to perform the division. This is conveniently achieved with ```sweep()```:
```{r}
tot <- colSums(mm)
n.mm <- sweep(mm,2,tot,"/")
```

Here, we work by column (2) and we apply a function of two parameters (./.) to each of them: first parameter is the matrix column and second parameter is the provided vector of values (```tot```). We can also work by row and change the function, including by a user-defined function.

\

## D. Miscellaneous

A few additional and useful functions to work with vectors:
```{r}
df$code
unique(df$code) # returns the distinct values in order of first occurrence
ru <- runif(500) # 500 pseudo-random numbers uniformly distributed over [0;1]
quantile(ru, prob=.95) # 95th percentile
median(ru)
quantile(ru, prob=0.5)
summary(ru)
quantile(ru, prob=0.25)
```


Set named vectors in one operation:
```{r}
# Elementary solution in two steps
employee.colors <- rainbow(nrow(df))
names(employee.colors) <- df$name
employee.colors
# The same in one step
employee.colors <- setNames(rainbow(nrow(df)), df$name)
employee.colors
```


And a nice little example linking our new abilities to play with matrices: upper quartile normalization. Instead of normalizing count matrices by their mean or total, or even their median (more robust to outliers), it is customary to normalize with respect to the upper quartile (3rd quartile or 75th percentile) of the non-zero values to find a better balance between a domination by the very high values or the many very low ones. In practice, it usually works better than the sums or medians but it is outperforms by more advanced procedures such as TMM.

From exercise series 1, we have:
```{r}
scdat <- read.csv("GSE77288_molecules-raw-single-per-sample.txt",sep="\t",stringsAsFactors=F)
counts <- t(data.matrix(scdat[,-(1:3)]))
colnames(counts) <- paste(scdat$individual,scdat$replicate,scdat$well,sep=".")
counts <- counts[-grep("ERCC",rownames(counts)),]
counts[1:5,1:10]

umi.tot <- colSums(counts)
bad.high <- umi.tot>150000
gene.tot <- colSums(counts>1)
bad.low <- gene.tot<3500

counts <- counts[,bad.high|bad.low]
good <- rowSums(counts>1)>=10 # UMI count > 1 in at leat 10 ells is required
sum(good)
counts <- counts[good,]
dim(counts)
```

Now, we perform the upper quartile normalization:
```{r}
# we need to perform quantile on each column
q <- apply(counts, 2, quantile, prob=0.75) #extra parameters to apply() are passed to the function
# but done like this, we keep all the 0's, this would not be true upper quartile normalization
q <- apply(counts, 2, function (x) quantile(x[x>0], prob=0.75)) # we define a function on-the-fly, x takes the value of each column successively
ncounts <- sweep(counts, 2, q/median(q), "/")
```