library(dplyr)
##
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
library(Seurat)
## Warning: le package 'Seurat' a été compilé avec la version R 4.2.2
## Attaching SeuratObject
library(patchwork)
## Warning: le package 'patchwork' a été compilé avec la version R 4.2.2
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir="PBMC-10x/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
names(pbmc@meta.data)
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA"
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features=c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)
plot1 <- FeatureScatter(pbmc, feature1="nCount_RNA", feature2="percent.mt")
plot1
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2
plot1 + plot2
pbmc[["reliable"]] <- pbmc[["nCount_RNA"]]<=7500 & pbmc[["percent.mt"]]<=7 # number of genes was imposed at loading
plot3 <- FeatureScatter(pbmc, feature1="nFeature_RNA", feature2="percent.mt", group.by="reliable")
plot3
pbmc <- subset(pbmc, subset=reliable)
pbmc <- NormalizeData(pbmc)
Variable genes are likely to be those that change their expression across cell populations. They are thus potential markers of the cell types present in the sample.
pbmc <- FindVariableFeatures(pbmc, nfeatures=2000)
# Display the 15 most variable genes
top15 <- head(VariableFeatures(pbmc), 15)
top15
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "FTH1"
## [8] "PF4" "FCER1A" "S100A8" "GNG11" "CLU" "HLA-DRA" "NKG7"
## [15] "GZMB"
# Plot variable features with labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot=plot1, points=top15, repel=TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing missing values (geom_point).
# Scaling followed by PCA
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features=all.genes)
## Centering and scaling data matrix
# PCA is computed on the 2000 most variable features
pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), ndims.print=1:3, nfeatures.print=10)
## PC_ 1
## Positive: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, CTSW, GZMM
## Negative: CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, S100A9, FTH1, TYMP
## PC_ 2
## Positive: NKG7, CST7, PRF1, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2
## Negative: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A
## Negative: IL7R, TMSB4X, VIM, S100A8, IL32, S100A6, FYB, S100A4, S100A9, GIMAP7
DimPlot(pbmc, reduction="pca")
We see that the first 10 components are sufficient to explain most of the variance.
ElbowPlot(pbmc)
The default clustering algorithm in Seurat is based on the
construction of a graph that connects close data points, and then on
searching for communities in this adjacency graph. The
distance between data points is computed in the reduced space if
available, e.g., after PCA reduction. In case, several
reduction metohds were applied, a parameter reduction
enables selecting which one is to be used.
pbmc <- FindNeighbors(pbmc, dims=1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution=0.5) # resolution controls the size of the communities
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2679
## Number of edges: 97421
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8735
## Number of communities: 9
## Elapsed time: 0 seconds
We first apply UMAP to obtain the projection and then we color code clusters.
pbmc <- RunUMAP(pbmc, dims=1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:42:47 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:42:47 Read 2679 rows and found 10 numeric columns
## 11:42:47 Using Annoy for neighbor search, n_neighbors = 30
## 11:42:47 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:42:48 Writing NN index file to temp file C:\Users\JACQUE~1.IRC\AppData\Local\Temp\RtmpsRtcA4\file1998327593d
## 11:42:48 Searching Annoy index using 1 thread, search_k = 3000
## 11:42:48 Annoy recall = 100%
## 11:42:49 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:42:49 Initializing from normalized Laplacian + noise (using irlba)
## 11:42:49 Commencing optimization for 500 epochs, with 106984 positive edges
## 11:42:56 Optimization finished
DimPlot(pbmc, reduction="umap")
Alternatively, we can compute the t-SNE 2D projection.
pbmc <- RunTSNE(pbmc, dims=1:10)
DimPlot(pbmc, reductio="tsne")
Now that we have (candidate) cell population after clustering, a natural next step is to find genes that are significantly more expressed in at least one cluster. Those genes should characterize the specific transcriptional program of each cluster and thus enable us to call the actual cell types.
The function FindMarkers
implements several statistical
tests for this purpose (see its documentation). Here, we simply use the
default (Wilcoxon test), which should be robust thanks to its
nonparametric nature, but reduce the statistical power compared to
dedicated tests such as the negative binomial-based one provided by
DESeq2 (also available).
In the example below, we compare cluster 2 with all the other cells and require that a gene is expressed in at least 25% of the cells of at least one of the two populations to avoid barely exressed genes and spedd up computations.
cluster2.markers <- FindMarkers(pbmc, ident.1=2, min.pct=0.25)
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
head(cluster2.markers, n=5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 1.474737e-80 1.1248862 0.953 0.477 2.022454e-76
## LTB 7.684094e-80 1.2650731 0.984 0.648 1.053797e-75
## LDHB 1.023887e-62 0.9523536 0.963 0.616 1.404158e-58
## CD3D 3.232823e-61 0.8684266 0.921 0.446 4.433494e-57
## IL7R 1.627196e-58 1.1491488 0.747 0.337 2.231536e-54
Next step, is to find the marker genes of all the populations.
pbmc.markers <- FindAllMarkers(pbmc, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n=2, order_by=avg_log2FC)
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.50e- 84 1.31 0.426 0.104 2.06e- 80 0 CCR7
## 2 1.93e-112 1.04 0.889 0.586 2.65e-108 0 LDHB
## 3 0 5.55 0.994 0.214 0 1 S100A9
## 4 0 5.48 0.975 0.12 0 1 S100A8
## 5 1.85e- 57 1.28 0.433 0.116 2.53e- 53 2 AQP3
## 6 7.68e- 80 1.27 0.984 0.648 1.05e- 75 2 LTB
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 2.29e-274 3.59 0.624 0.022 3.14e-270 3 TCL1A
## 9 1.52e-202 3.05 0.983 0.236 2.09e-198 4 CCL5
## 10 1.17e-169 2.96 0.595 0.058 1.60e-165 4 GZMK
## 11 4.71e-188 3.32 0.975 0.132 6.46e-184 5 FCGR3A
## 12 3.59e-126 3.07 1 0.313 4.92e-122 5 LST1
## 13 1.03e-191 5.35 0.961 0.131 1.42e-187 6 GNLY
## 14 1.14e-267 4.79 0.961 0.069 1.56e-263 6 GZMB
## 15 1.35e-246 3.94 0.844 0.01 1.85e-242 7 FCER1A
## 16 3.44e- 21 2.75 1 0.507 4.72e- 17 7 HLA-DPB1
## 17 5.26e-104 8.65 1 0.024 7.22e-100 8 PPBP
## 18 3.10e-184 7.31 1 0.011 4.25e-180 8 PF4
Chosen marker gene expression can be displayed on the 2D projection
coordinates to assess their specificity for clusters. UMAP reduction is
taken by default but the reduction
parameter allows
selecting another projection.
FeaturePlot(pbmc, features=c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
We can also get an heatmap to represent cluster/marker gene associations.
top10 <- pbmc.markers %>%
group_by(cluster) %>%
top_n(n=10, wt=avg_log2FC)
DoHeatmap(pbmc, features=top10$gene) + NoLegend()
Based on the above markers and literature knowledge, an expert should be able to recognize each cluster identity, i.e., which cell type corresponds to each cluster. This step is also typically the occasion to iterate on the clustering parameters or the clustering algorithm, but also the normalization or the number of principal components to keep, to achieve convincing clusters with a clear biological meaning.
Finally, we can assign names to the clusters for display purposes.
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction="umap", label=TRUE, pt.size=0.5) + NoLegend()
DimPlot(pbmc, reduction="tsne", label=TRUE, pt.size=0.5) + NoLegend()
Seurat comes with different graphical representations to relate the expression of chosen genes to clusters. Here are the main ones.
Ridgeplots to follow gene expression distributions across clusters.
features <- c("LYZ", "CD74", "CST3", "CD3D", "FCGR3A", "GZMA")
RidgePlot(pbmc, features=features, ncol=2)
## Picking joint bandwidth of 0.324
## Picking joint bandwidth of 0.267
## Picking joint bandwidth of 0.155
## Picking joint bandwidth of 0.0666
## Picking joint bandwidth of 0.0586
## Picking joint bandwidth of 0.0499
Violin plots.
VlnPlot(pbmc, features=features[c(1,2,4,6)])
Dot plots.
DotPlot(pbmc, features=features) + RotatedAxis()
Targeted heatmaps.
DoHeatmap(subset(pbmc, downsample=100), features=features, size=3)
Feature mapping with robust color scales (extreme values do not influence the color scale)
FeaturePlot(pbmc, features=c("MS4A1","PTPRCAP"), min.cutoff="q10", max.cutoff="q90")
FeaturePlot(pbmc, features=c("MS4A1","CD79A"), blend=TRUE)
To avoid re-executing all the steps necessary to the preparation of a certain Seurat object (can be slow), you can serialize it into a file.
saveRDS(pbmc, file="pbmc_final.rds")