Template code for basic Seurat analysis

Loading libraries and data (in 10x Genomics format)

library(dplyr)

## 
## Attachement du package : 'dplyr'

## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag

## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## Warning: le package 'Seurat' a été compilé avec la version R 4.2.2

## Attaching SeuratObject

library(patchwork)

## Warning: le package 'patchwork' a été compilé avec la version R 4.2.2

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir="PBMC-10x/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

A Seurat object contains features for each cell, some of which are precomputed

names(pbmc@meta.data)

## [1] "orig.ident"   "nCount_RNA"   "nFeature_RNA"

It is possible to add user-computed features

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-")

Violin plots can be used to display features

# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features=c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)

Scatter plots are available as well

plot1 <- FeatureScatter(pbmc, feature1="nCount_RNA", feature2="percent.mt")
plot1

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2

plot1 + plot2

Basic filtering

pbmc[["reliable"]] <- pbmc[["nCount_RNA"]]<=7500 & pbmc[["percent.mt"]]<=7 # number of genes was imposed at loading
plot3 <- FeatureScatter(pbmc, feature1="nFeature_RNA", feature2="percent.mt", group.by="reliable")
plot3

pbmc <- subset(pbmc, subset=reliable)

Elementary data normalization

pbmc <- NormalizeData(pbmc)

Search for the most variable genes

Variable genes are likely to be those that change their expression across cell populations. They are thus potential markers of the cell types present in the sample.

pbmc <- FindVariableFeatures(pbmc, nfeatures=2000)

# Display the 15 most variable genes
top15 <- head(VariableFeatures(pbmc), 15)
top15

##  [1] "PPBP"    "LYZ"     "S100A9"  "IGLL5"   "GNLY"    "FTL"     "FTH1"   
##  [8] "PF4"     "FCER1A"  "S100A8"  "GNG11"   "CLU"     "HLA-DRA" "NKG7"   
## [15] "GZMB"

# Plot variable features with labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot=plot1, points=top15, repel=TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

plot2

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1 rows containing missing values (geom_point).

Principal component analysis

# Scaling followed by PCA
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features=all.genes)

## Centering and scaling data matrix

# PCA is computed on the 2000 most variable features
pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), ndims.print=1:3, nfeatures.print=10)

## PC_ 1 
## Positive:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, CTSW, GZMM 
## Negative:  CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, S100A9, FTH1, TYMP 
## PC_ 2 
## Positive:  NKG7, CST7, PRF1, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2 
## Negative:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A 
## Negative:  IL7R, TMSB4X, VIM, S100A8, IL32, S100A6, FYB, S100A4, S100A9, GIMAP7

2D projection after PCA 2 first principal components

DimPlot(pbmc, reduction="pca")

Variance explained by the first 10 principal components

We see that the first 10 components are sufficient to explain most of the variance.

ElbowPlot(pbmc)

Clustering to identify cell populations

The default clustering algorithm in Seurat is based on the construction of a graph that connects close data points, and then on searching for communities in this adjacency graph. The distance between data points is computed in the reduced space if available, e.g., after PCA reduction. In case, several reduction metohds were applied, a parameter reduction enables selecting which one is to be used.

pbmc <- FindNeighbors(pbmc, dims=1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution=0.5) # resolution controls the size of the communities

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2679
## Number of edges: 97421
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8735
## Number of communities: 9
## Elapsed time: 0 seconds

Visualizing the clusters on a 2D projection

We first apply UMAP to obtain the projection and then we color code clusters.

pbmc <- RunUMAP(pbmc, dims=1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 11:42:47 UMAP embedding parameters a = 0.9922 b = 1.112

## 11:42:47 Read 2679 rows and found 10 numeric columns

## 11:42:47 Using Annoy for neighbor search, n_neighbors = 30

## 11:42:47 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 11:42:48 Writing NN index file to temp file C:\Users\JACQUE~1.IRC\AppData\Local\Temp\RtmpsRtcA4\file1998327593d
## 11:42:48 Searching Annoy index using 1 thread, search_k = 3000
## 11:42:48 Annoy recall = 100%
## 11:42:49 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:42:49 Initializing from normalized Laplacian + noise (using irlba)
## 11:42:49 Commencing optimization for 500 epochs, with 106984 positive edges
## 11:42:56 Optimization finished

DimPlot(pbmc, reduction="umap")

Alternatively, we can compute the t-SNE 2D projection.

pbmc <- RunTSNE(pbmc, dims=1:10)
DimPlot(pbmc, reductio="tsne")

Find population markers

Now that we have (candidate) cell population after clustering, a natural next step is to find genes that are significantly more expressed in at least one cluster. Those genes should characterize the specific transcriptional program of each cluster and thus enable us to call the actual cell types.

The function FindMarkers implements several statistical tests for this purpose (see its documentation). Here, we simply use the default (Wilcoxon test), which should be robust thanks to its nonparametric nature, but reduce the statistical power compared to dedicated tests such as the negative binomial-based one provided by DESeq2 (also available).

In the example below, we compare cluster 2 with all the other cells and require that a gene is expressed in at least 25% of the cells of at least one of the two populations to avoid barely exressed genes and spedd up computations.

cluster2.markers <- FindMarkers(pbmc, ident.1=2, min.pct=0.25)

## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

head(cluster2.markers, n=5)

##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 1.474737e-80  1.1248862 0.953 0.477 2.022454e-76
## LTB  7.684094e-80  1.2650731 0.984 0.648 1.053797e-75
## LDHB 1.023887e-62  0.9523536 0.963 0.616 1.404158e-58
## CD3D 3.232823e-61  0.8684266 0.921 0.446 4.433494e-57
## IL7R 1.627196e-58  1.1491488 0.747 0.337 2.231536e-54

Next step, is to find the marker genes of all the populations.

pbmc.markers <- FindAllMarkers(pbmc, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25)

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

pbmc.markers %>%
  group_by(cluster) %>%
  slice_max(n=2, order_by=avg_log2FC)

## # A tibble: 18 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 1.50e- 84       1.31 0.426 0.104 2.06e- 80 0       CCR7    
##  2 1.93e-112       1.04 0.889 0.586 2.65e-108 0       LDHB    
##  3 0               5.55 0.994 0.214 0         1       S100A9  
##  4 0               5.48 0.975 0.12  0         1       S100A8  
##  5 1.85e- 57       1.28 0.433 0.116 2.53e- 53 2       AQP3    
##  6 7.68e- 80       1.27 0.984 0.648 1.05e- 75 2       LTB     
##  7 0               4.31 0.936 0.041 0         3       CD79A   
##  8 2.29e-274       3.59 0.624 0.022 3.14e-270 3       TCL1A   
##  9 1.52e-202       3.05 0.983 0.236 2.09e-198 4       CCL5    
## 10 1.17e-169       2.96 0.595 0.058 1.60e-165 4       GZMK    
## 11 4.71e-188       3.32 0.975 0.132 6.46e-184 5       FCGR3A  
## 12 3.59e-126       3.07 1     0.313 4.92e-122 5       LST1    
## 13 1.03e-191       5.35 0.961 0.131 1.42e-187 6       GNLY    
## 14 1.14e-267       4.79 0.961 0.069 1.56e-263 6       GZMB    
## 15 1.35e-246       3.94 0.844 0.01  1.85e-242 7       FCER1A  
## 16 3.44e- 21       2.75 1     0.507 4.72e- 17 7       HLA-DPB1
## 17 5.26e-104       8.65 1     0.024 7.22e-100 8       PPBP    
## 18 3.10e-184       7.31 1     0.011 4.25e-180 8       PF4

Chosen marker gene expression can be displayed on the 2D projection coordinates to assess their specificity for clusters. UMAP reduction is taken by default but the reduction parameter allows selecting another projection.

FeaturePlot(pbmc, features=c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
                             "CD8A"))

We can also get an heatmap to represent cluster/marker gene associations.

top10 <- pbmc.markers %>%
  group_by(cluster) %>%
  top_n(n=10, wt=avg_log2FC)
DoHeatmap(pbmc, features=top10$gene) + NoLegend()

Cell type calling

Based on the above markers and literature knowledge, an expert should be able to recognize each cluster identity, i.e., which cell type corresponds to each cluster. This step is also typically the occasion to iterate on the clustering parameters or the clustering algorithm, but also the normalization or the number of principal components to keep, to achieve convincing clusters with a clear biological meaning.

Finally, we can assign names to the clusters for display purposes.

new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
                     "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction="umap", label=TRUE, pt.size=0.5) + NoLegend()

DimPlot(pbmc, reduction="tsne", label=TRUE, pt.size=0.5) + NoLegend()

Additional plotting capabilities

Seurat comes with different graphical representations to relate the expression of chosen genes to clusters. Here are the main ones.

Ridgeplots to follow gene expression distributions across clusters.

features <- c("LYZ", "CD74", "CST3", "CD3D", "FCGR3A", "GZMA")
RidgePlot(pbmc, features=features, ncol=2)

## Picking joint bandwidth of 0.324

## Picking joint bandwidth of 0.267

## Picking joint bandwidth of 0.155

## Picking joint bandwidth of 0.0666

## Picking joint bandwidth of 0.0586

## Picking joint bandwidth of 0.0499

Violin plots.

VlnPlot(pbmc, features=features[c(1,2,4,6)])

Dot plots.

DotPlot(pbmc, features=features) + RotatedAxis()

Targeted heatmaps.

DoHeatmap(subset(pbmc, downsample=100), features=features, size=3)

Feature mapping with robust color scales (extreme values do not influence the color scale)

FeaturePlot(pbmc, features=c("MS4A1","PTPRCAP"), min.cutoff="q10", max.cutoff="q90")

FeaturePlot(pbmc, features=c("MS4A1","CD79A"), blend=TRUE)

Saving (serializing) a Seurat object

To avoid re-executing all the steps necessary to the preparation of a certain Seurat object (can be slow), you can serialize it into a file.

saveRDS(pbmc, file="pbmc_final.rds")