Operations on tables

A. Sizes

v <- 12:27
v

##  [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

length(v)

## [1] 16

mm <- matrix(rnorm(35, mean=4, sd=2), nrow=5)
mm

##          [,1]     [,2]     [,3]     [,4]      [,5]       [,6]     [,7]
## [1,] 4.422645 2.164912 3.720575 7.907344 0.8867708 -2.0308274 3.805649
## [2,] 2.338808 6.000361 5.912488 1.865291 1.2232926  1.4228895 2.821196
## [3,] 4.870854 1.867368 3.962214 6.570583 3.9505622 -0.1339156 3.735714
## [4,] 6.186425 5.222798 5.245248 7.295909 2.7836256  4.1177085 4.933500
## [5,] 5.419159 2.661895 4.390685 2.872210 1.2248497  4.6288974 3.120099

dim(mm)

## [1] 5 7

dim(mm)[1]

## [1] 5

nrow(mm)

## [1] 5

ncol(mm)

## [1] 7

df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
dim(df)

## [1] 4 4

ncol(df)

## [1] 4

B. Global numerical operations

Those functions work with both data frames and matrices provided they contain numerical values.

sum(v) # vector version

## [1] 312

sum(mm) # matrix version (full total)

## [1] 127.3878

rowSums(mm) # matrix version by row

## [1] 20.87707 21.58433 24.82338 35.78521 24.31779

colSums(mm)

## [1] 23.237891 17.917334 23.231210 26.511337 10.069101  8.004752 18.416158

rowMeans(mm)

## [1] 2.982438 3.083475 3.546197 5.112173 3.473971

colMeans(mm)

## [1] 4.647578 3.583467 4.646242 5.302267 2.013820 1.600950 3.683232

colMeans(mm, na.rm=TRUE) # also for global operations if needed

## [1] 4.647578 3.583467 4.646242 5.302267 2.013820 1.600950 3.683232

Sums of local values count the number of TRUE values:

v>15

##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE

sum(v>15)

## [1] 12

mm>0.5

##      [,1] [,2] [,3] [,4] [,5]  [,6] [,7]
## [1,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE  TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE  TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE  TRUE TRUE

sum(mm>0.5)

## [1] 33

rowSums(mm>0.5)

## [1] 6 7 6 7 7

We may need to transpose a matrix, i.e., we exchange rows and columns (first row becomes first column, second row becomes second column, etc.):

t(mm)

##            [,1]     [,2]       [,3]     [,4]     [,5]
## [1,]  4.4226450 2.338808  4.8708541 6.186425 5.419159
## [2,]  2.1649118 6.000361  1.8673676 5.222798 2.661895
## [3,]  3.7205751 5.912488  3.9622143 5.245248 4.390685
## [4,]  7.9073443 1.865291  6.5705833 7.295909 2.872210
## [5,]  0.8867708 1.223293  3.9505622 2.783626 1.224850
## [6,] -2.0308274 1.422889 -0.1339156 4.117709 4.628897
## [7,]  3.8056487 2.821196  3.7357142 4.933500 3.120099

C. Non predifined global operations

Although there are predefined function for sums or means by rows and columns, we often need to perform other operations by rows or columns. apply() enables us to submit each row or column of a data frame or a matrix to a predefined function or even a function of our own:

apply(mm,1,sd) # we apply the standard deviation function to each row (1)

## [1] 3.102269 2.034562 2.146153 1.437340 1.423805

apply(mm,2,sd) # we apply the standard deviation function to each column (2)

## [1] 1.4486417 1.8931043 0.9154575 2.7426102 1.3096964 2.8166385 0.8125784

CV <- apply(mm,1,sd)/rowMeans(mm) # coefficient of variation of each row
CV

## [1] 1.0401786 0.6598276 0.6051983 0.2811603 0.4098496

To normalize gene transcript expression data, we have seen that we may want to divide each column of a matrix by a different value, e.g., its total. That is, a vector of values, one per column, is provided and those values must be used for the corresponding columns to perform the division. This is conveniently achieved with sweep():

tot <- colSums(mm)
n.mm <- sweep(mm,2,tot,"/")

Here, we work by column (2) and we apply a function of two parameters (./.) to each of them: first parameter is the matrix column and second parameter is the provided vector of values (tot). We can also work by row and change the function, including by a user-defined function.

D. Miscellaneous

A few additional and useful functions to work with vectors:

df$code

## [1] "A" "A" "D" "E"

unique(df$code) # returns the distinct values in order of first occurrence

## [1] "A" "D" "E"

ru <- runif(500) # 500 pseudo-random numbers uniformly distributed over [0;1]
quantile(ru, prob=.95) # 95th percentile

##       95% 
## 0.9447667

median(ru)

## [1] 0.533415

quantile(ru, prob=0.5)

##      50% 
## 0.533415

summary(ru)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00207 0.27772 0.53341 0.50853 0.74799 0.99040

quantile(ru, prob=0.25)

##       25% 
## 0.2777235

Set named vectors in one operation:

# Elementary solution in two steps
employee.colors <- rainbow(nrow(df))
names(employee.colors) <- df$name
employee.colors

##    Pierre  Amandine    Gudrun  Dagobert 
## "#FF0000" "#80FF00" "#00FFFF" "#8000FF"

# The same in one step
employee.colors <- setNames(rainbow(nrow(df)), df$name)
employee.colors

##    Pierre  Amandine    Gudrun  Dagobert 
## "#FF0000" "#80FF00" "#00FFFF" "#8000FF"

And a nice little example linking our new abilities to play with matrices: upper quartile normalization. Instead of normalizing count matrices by their mean or total, or even their median (more robust to outliers), it is customary to normalize with respect to the upper quartile (3rd quartile or 75th percentile) of the non-zero values to find a better balance between a domination by the very high values or the many very low ones. In practice, it usually works better than the sums or medians but it is outperforms by more advanced procedures such as TMM.

From exercise series 1, we have:

scdat <- read.csv("GSE77288_molecules-raw-single-per-sample.txt",sep="\t",stringsAsFactors=F)
counts <- t(data.matrix(scdat[,-(1:3)]))
colnames(counts) <- paste(scdat$individual,scdat$replicate,scdat$well,sep=".")
counts <- counts[-grep("ERCC",rownames(counts)),]
counts[1:5,1:10]

##                 NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03 NA19098.r1.A04
## ENSG00000186092              0              0              0              0
## ENSG00000237683              0              0              0              1
## ENSG00000235249              0              0              0              0
## ENSG00000185097              0              0              0              0
## ENSG00000269831              0              0              0              0
##                 NA19098.r1.A05 NA19098.r1.A06 NA19098.r1.A07 NA19098.r1.A08
## ENSG00000186092              0              0              0              0
## ENSG00000237683              0              0              0              0
## ENSG00000235249              0              0              0              0
## ENSG00000185097              0              0              0              0
## ENSG00000269831              0              0              0              0
##                 NA19098.r1.A09 NA19098.r1.A10
## ENSG00000186092              0              0
## ENSG00000237683              0              0
## ENSG00000235249              0              0
## ENSG00000185097              0              0
## ENSG00000269831              0              0

umi.tot <- colSums(counts)
bad.high <- umi.tot>150000
gene.tot <- colSums(counts>1)
bad.low <- gene.tot<3500

counts <- counts[,bad.high|bad.low]
good <- rowSums(counts>1)>=10 # UMI count > 1 in at leat 10 ells is required
sum(good)

## [1] 7873

counts <- counts[good,]
dim(counts)

## [1] 7873   66

Now, we perform the upper quartile normalization:

# we need to perform quantile on each column
q <- apply(counts, 2, quantile, prob=0.75) #extra parameters to apply() are passed to the function
# but done like this, we keep all the 0's, this would not be true upper quartile normalization
q <- apply(counts, 2, function (x) quantile(x[x>0], prob=0.75)) # we define a function on-the-fly, x takes the value of each column successively
ncounts <- sweep(counts, 2, q/median(q), "/")

Operations on tables

Jacques Colinge

11/29/2021

A. Sizes

B. Global numerical operations

C. Non predifined global operations

D. Miscellaneous