v <- 12:27
v
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
length(v)
## [1] 16
mm <- matrix(rnorm(35, mean=4, sd=2), nrow=5)
mm
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 4.422645 2.164912 3.720575 7.907344 0.8867708 -2.0308274 3.805649
## [2,] 2.338808 6.000361 5.912488 1.865291 1.2232926 1.4228895 2.821196
## [3,] 4.870854 1.867368 3.962214 6.570583 3.9505622 -0.1339156 3.735714
## [4,] 6.186425 5.222798 5.245248 7.295909 2.7836256 4.1177085 4.933500
## [5,] 5.419159 2.661895 4.390685 2.872210 1.2248497 4.6288974 3.120099
dim(mm)
## [1] 5 7
dim(mm)[1]
## [1] 5
nrow(mm)
## [1] 5
ncol(mm)
## [1] 7
df <- data.frame(name=c("Pierre","Amandine","Gudrun","Dagobert"),salary=c(3000,4000,8000,-500),code=c("A","A","D","E"),tech=c(TRUE,FALSE,FALSE,FALSE))
dim(df)
## [1] 4 4
ncol(df)
## [1] 4
Those functions work with both data frames and matrices provided they contain numerical values.
sum(v) # vector version
## [1] 312
sum(mm) # matrix version (full total)
## [1] 127.3878
rowSums(mm) # matrix version by row
## [1] 20.87707 21.58433 24.82338 35.78521 24.31779
colSums(mm)
## [1] 23.237891 17.917334 23.231210 26.511337 10.069101 8.004752 18.416158
rowMeans(mm)
## [1] 2.982438 3.083475 3.546197 5.112173 3.473971
colMeans(mm)
## [1] 4.647578 3.583467 4.646242 5.302267 2.013820 1.600950 3.683232
colMeans(mm, na.rm=TRUE) # also for global operations if needed
## [1] 4.647578 3.583467 4.646242 5.302267 2.013820 1.600950 3.683232
Sums of local values count the number of TRUE
values:
v>15
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE
sum(v>15)
## [1] 12
mm>0.5
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sum(mm>0.5)
## [1] 33
rowSums(mm>0.5)
## [1] 6 7 6 7 7
We may need to transpose a matrix, i.e., we exchange rows and columns (first row becomes first column, second row becomes second column, etc.):
t(mm)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4.4226450 2.338808 4.8708541 6.186425 5.419159
## [2,] 2.1649118 6.000361 1.8673676 5.222798 2.661895
## [3,] 3.7205751 5.912488 3.9622143 5.245248 4.390685
## [4,] 7.9073443 1.865291 6.5705833 7.295909 2.872210
## [5,] 0.8867708 1.223293 3.9505622 2.783626 1.224850
## [6,] -2.0308274 1.422889 -0.1339156 4.117709 4.628897
## [7,] 3.8056487 2.821196 3.7357142 4.933500 3.120099
Although there are predefined function for sums or means by rows and
columns, we often need to perform other operations by rows or columns.
apply()
enables us to submit each row or column of a data
frame or a matrix to a predefined function or even a function of our
own:
apply(mm,1,sd) # we apply the standard deviation function to each row (1)
## [1] 3.102269 2.034562 2.146153 1.437340 1.423805
apply(mm,2,sd) # we apply the standard deviation function to each column (2)
## [1] 1.4486417 1.8931043 0.9154575 2.7426102 1.3096964 2.8166385 0.8125784
CV <- apply(mm,1,sd)/rowMeans(mm) # coefficient of variation of each row
CV
## [1] 1.0401786 0.6598276 0.6051983 0.2811603 0.4098496
To normalize gene transcript expression data, we have seen that we
may want to divide each column of a matrix by a different value,
e.g., its total. That is, a vector of values, one per column,
is provided and those values must be used for the corresponding columns
to perform the division. This is conveniently achieved with
sweep()
:
tot <- colSums(mm)
n.mm <- sweep(mm,2,tot,"/")
Here, we work by column (2) and we apply a function of two parameters
(./.) to each of them: first parameter is the matrix column and second
parameter is the provided vector of values (tot
). We can
also work by row and change the function, including by a user-defined
function.
A few additional and useful functions to work with vectors:
df$code
## [1] "A" "A" "D" "E"
unique(df$code) # returns the distinct values in order of first occurrence
## [1] "A" "D" "E"
ru <- runif(500) # 500 pseudo-random numbers uniformly distributed over [0;1]
quantile(ru, prob=.95) # 95th percentile
## 95%
## 0.9447667
median(ru)
## [1] 0.533415
quantile(ru, prob=0.5)
## 50%
## 0.533415
summary(ru)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00207 0.27772 0.53341 0.50853 0.74799 0.99040
quantile(ru, prob=0.25)
## 25%
## 0.2777235
Set named vectors in one operation:
# Elementary solution in two steps
employee.colors <- rainbow(nrow(df))
names(employee.colors) <- df$name
employee.colors
## Pierre Amandine Gudrun Dagobert
## "#FF0000" "#80FF00" "#00FFFF" "#8000FF"
# The same in one step
employee.colors <- setNames(rainbow(nrow(df)), df$name)
employee.colors
## Pierre Amandine Gudrun Dagobert
## "#FF0000" "#80FF00" "#00FFFF" "#8000FF"
And a nice little example linking our new abilities to play with matrices: upper quartile normalization. Instead of normalizing count matrices by their mean or total, or even their median (more robust to outliers), it is customary to normalize with respect to the upper quartile (3rd quartile or 75th percentile) of the non-zero values to find a better balance between a domination by the very high values or the many very low ones. In practice, it usually works better than the sums or medians but it is outperforms by more advanced procedures such as TMM.
From exercise series 1, we have:
scdat <- read.csv("GSE77288_molecules-raw-single-per-sample.txt",sep="\t",stringsAsFactors=F)
counts <- t(data.matrix(scdat[,-(1:3)]))
colnames(counts) <- paste(scdat$individual,scdat$replicate,scdat$well,sep=".")
counts <- counts[-grep("ERCC",rownames(counts)),]
counts[1:5,1:10]
## NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03 NA19098.r1.A04
## ENSG00000186092 0 0 0 0
## ENSG00000237683 0 0 0 1
## ENSG00000235249 0 0 0 0
## ENSG00000185097 0 0 0 0
## ENSG00000269831 0 0 0 0
## NA19098.r1.A05 NA19098.r1.A06 NA19098.r1.A07 NA19098.r1.A08
## ENSG00000186092 0 0 0 0
## ENSG00000237683 0 0 0 0
## ENSG00000235249 0 0 0 0
## ENSG00000185097 0 0 0 0
## ENSG00000269831 0 0 0 0
## NA19098.r1.A09 NA19098.r1.A10
## ENSG00000186092 0 0
## ENSG00000237683 0 0
## ENSG00000235249 0 0
## ENSG00000185097 0 0
## ENSG00000269831 0 0
umi.tot <- colSums(counts)
bad.high <- umi.tot>150000
gene.tot <- colSums(counts>1)
bad.low <- gene.tot<3500
counts <- counts[,bad.high|bad.low]
good <- rowSums(counts>1)>=10 # UMI count > 1 in at leat 10 ells is required
sum(good)
## [1] 7873
counts <- counts[good,]
dim(counts)
## [1] 7873 66
Now, we perform the upper quartile normalization:
# we need to perform quantile on each column
q <- apply(counts, 2, quantile, prob=0.75) #extra parameters to apply() are passed to the function
# but done like this, we keep all the 0's, this would not be true upper quartile normalization
q <- apply(counts, 2, function (x) quantile(x[x>0], prob=0.75)) # we define a function on-the-fly, x takes the value of each column successively
ncounts <- sweep(counts, 2, q/median(q), "/")