--- title: "Operations on tables" author: "Jacques Colinge" date: "11/29/2021" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` \ ## A. Sizes ```{r} v <- 12:27 v length(v) mm <- matrix(rnorm(35, mean=4, sd=2), nrow=5) mm dim(mm) dim(mm)[1] nrow(mm) ncol(mm) df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE)) dim(df) ncol(df) ``` \ ## B. Global numerical operations Those functions work with both data frames and matrices provided they contain numerical values. ```{r} sum(v) # vector version sum(mm) # matrix version (full total) rowSums(mm) # matrix version by row colSums(mm) rowMeans(mm) colMeans(mm) colMeans(mm, na.rm=TRUE) # also for global operations if needed ``` Sums of local values count the number of ```TRUE``` values: ```{r} v>15 sum(v>15) mm>0.5 sum(mm>0.5) rowSums(mm>0.5) ``` We may need to transpose a matrix, *i.e.*, we exchange rows and columns (first row becomes first column, second row becomes second column, etc.): ```{r} t(mm) ``` \ ## C. Non predifined global operations Although there are predefined function for sums or means by rows and columns, we often need to perform other operations by rows or columns. ```apply()``` enables us to submit each row or column of a data frame or a matrix to a predefined function or even a function of our own: ```{r} apply(mm,1,sd) # we apply the standard deviation function to each row (1) apply(mm,2,sd) # we apply the standard deviation function to each column (2) CV <- apply(mm,1,sd)/rowMeans(mm) # coefficient of variation of each row CV ``` To normalize gene transcript expression data, we have seen that we may want to divide each column of a matrix by a different value, *e.g.*, its total. That is, a vector of values, one per column, is provided and those values must be used for the corresponding columns to perform the division. This is conveniently achieved with ```sweep()```: ```{r} tot <- colSums(mm) n.mm <- sweep(mm,2,tot,"/") ``` Here, we work by column (2) and we apply a function of two parameters (./.) to each of them: first parameter is the matrix column and second parameter is the provided vector of values (```tot```). We can also work by row and change the function, including by a user-defined function. \ ## D. Miscellaneous A few additional and useful functions to work with vectors: ```{r} df$code unique(df$code) # returns the distinct values in order of first occurrence ru <- runif(500) # 500 pseudo-random numbers uniformly distributed over [0;1] quantile(ru, prob=.95) # 95th percentile median(ru) quantile(ru, prob=0.5) summary(ru) quantile(ru, prob=0.25) ``` Set named vectors in one operation: ```{r} # Elementary solution in two steps employee.colors <- rainbow(nrow(df)) names(employee.colors) <- df$name employee.colors # The same in one step employee.colors <- setNames(rainbow(nrow(df)), df$name) employee.colors ``` And a nice little example linking our new abilities to play with matrices: upper quartile normalization. Instead of normalizing count matrices by their mean or total, or even their median (more robust to outliers), it is customary to normalize with respect to the upper quartile (3rd quartile or 75th percentile) of the non-zero values to find a better balance between a domination by the very high values or the many very low ones. In practice, it usually works better than the sums or medians but it is outperforms by more advanced procedures such as TMM. From exercise series 1, we have: ```{r} scdat <- read.csv("GSE77288_molecules-raw-single-per-sample.txt",sep="\t",stringsAsFactors=F) counts <- t(data.matrix(scdat[,-(1:3)])) colnames(counts) <- paste(scdat$individual,scdat$replicate,scdat$well,sep=".") counts <- counts[-grep("ERCC",rownames(counts)),] counts[1:5,1:10] umi.tot <- colSums(counts) bad.high <- umi.tot>150000 gene.tot <- colSums(counts>1) bad.low <- gene.tot<3500 counts <- counts[,bad.high|bad.low] good <- rowSums(counts>1)>=10 # UMI count > 1 in at leat 10 ells is required sum(good) counts <- counts[good,] dim(counts) ``` Now, we perform the upper quartile normalization: ```{r} # we need to perform quantile on each column q <- apply(counts, 2, quantile, prob=0.75) #extra parameters to apply() are passed to the function # but done like this, we keep all the 0's, this would not be true upper quartile normalization q <- apply(counts, 2, function (x) quantile(x[x>0], prob=0.75)) # we define a function on-the-fly, x takes the value of each column successively ncounts <- sweep(counts, 2, q/median(q), "/") ```