# Template code for basic Seurat analysis ### Loading libraries and data (in 10x Genomics format) ```{r} library(dplyr) library(Seurat) library(patchwork) # Load the PBMC dataset pbmc.data <- Read10X(data.dir="PBMC-10x/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts=pbmc.data, project="pbmc3k", min.cells=3, min.features=200) pbmc ``` ### A Seurat object contains features for each cell, some of which are precomputed ```{r} names(pbmc@meta.data) ``` ### It is possible to add user-computed features ```{r} # The [[ operator can add columns to object metadata. This is a great place to stash QC stats pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern="^MT-") ``` ### Violin plots can be used to display features ```{r} # Visualize QC metrics as a violin plot VlnPlot(pbmc, features=c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3) ``` ### Scatter plots are available as well ```{r} plot1 <- FeatureScatter(pbmc, feature1="nCount_RNA", feature2="percent.mt") plot1 ``` ```{r} plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot2 ``` ```{r} plot1 + plot2 ``` ### Basic filtering ```{r} pbmc[["reliable"]] <- pbmc[["nCount_RNA"]]<=7500 & pbmc[["percent.mt"]]<=7 # number of genes was imposed at loading plot3 <- FeatureScatter(pbmc, feature1="nFeature_RNA", feature2="percent.mt", group.by="reliable") plot3 pbmc <- subset(pbmc, subset=reliable) ``` ### Elementary data normalization ```{r} pbmc <- NormalizeData(pbmc) ``` ### Search for the most variable genes Variable genes are likely to be those that change their expression across cell populations. They are thus potential markers of the cell types present in the sample. ```{r} pbmc <- FindVariableFeatures(pbmc, nfeatures=2000) # Display the 15 most variable genes top15 <- head(VariableFeatures(pbmc), 15) top15 # Plot variable features with labels plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot=plot1, points=top15, repel=TRUE) plot2 ``` ### Principal component analysis ```{r} # Scaling followed by PCA all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features=all.genes) # PCA is computed on the 2000 most variable features pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), ndims.print=1:3, nfeatures.print=10) ``` ### 2D projection after PCA 2 first principal components ```{r} DimPlot(pbmc, reduction="pca") ``` ### Variance explained by the first 10 principal components We see that the first 10 components are sufficient to explain most of the variance. ```{r} ElbowPlot(pbmc) ``` ### Clustering to identify cell populations The default clustering algorithm in Seurat is based on the construction of a graph that connects close data points, and then on searching for communities in this **adjacency** graph. The distance between data points is computed in the reduced space if available, *e.g.*, after PCA reduction. In case, several reduction metohds were applied, a parameter ```reduction``` enables selecting which one is to be used. ```{r} pbmc <- FindNeighbors(pbmc, dims=1:10) pbmc <- FindClusters(pbmc, resolution=0.5) # resolution controls the size of the communities ``` ### Visualizing the clusters on a 2D projection We first apply UMAP to obtain the projection and then we color code clusters. ```{r} pbmc <- RunUMAP(pbmc, dims=1:10) DimPlot(pbmc, reduction="umap") ``` Alternatively, we can compute the t-SNE 2D projection. ```{r} pbmc <- RunTSNE(pbmc, dims=1:10) DimPlot(pbmc, reductio="tsne") ``` ### Find population markers Now that we have (candidate) cell population after clustering, a natural next step is to find genes that are significantly more expressed in at least one cluster. Those genes should characterize the specific transcriptional program of each cluster and thus enable us to call the actual cell types. The function ```FindMarkers``` implements several statistical tests for this purpose (see its documentation). Here, we simply use the default (Wilcoxon test), which should be robust thanks to its nonparametric nature, but reduce the statistical power compared to dedicated tests such as the negative binomial-based one provided by DESeq2 (also available). In the example below, we compare cluster 2 with all the other cells and require that a gene is expressed in at least 25% of the cells of at least one of the two populations to avoid barely exressed genes and spedd up computations. ```{r} cluster2.markers <- FindMarkers(pbmc, ident.1=2, min.pct=0.25) head(cluster2.markers, n=5) ``` Next step, is to find the marker genes of all the populations. ```{r} pbmc.markers <- FindAllMarkers(pbmc, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25) pbmc.markers %>% group_by(cluster) %>% slice_max(n=2, order_by=avg_log2FC) ``` Chosen marker gene expression can be displayed on the 2D projection coordinates to assess their specificity for clusters. UMAP reduction is taken by default but the ```reduction``` parameter allows selecting another projection. ```{r} FeaturePlot(pbmc, features=c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")) ``` We can also get an heatmap to represent cluster/marker gene associations. ```{r} top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n=10, wt=avg_log2FC) DoHeatmap(pbmc, features=top10$gene) + NoLegend() ``` ### Cell type calling Based on the above markers and literature knowledge, an **expert** should be able to recognize each cluster identity, *i.e.*, which cell type corresponds to each cluster. This step is also typically the occasion to iterate on the clustering parameters or the clustering algorithm, but also the normalization or the number of principal components to keep, to achieve convincing clusters with a clear biological meaning. Finally, we can assign names to the clusters for display purposes. ```{r} new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet") names(new.cluster.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.cluster.ids) DimPlot(pbmc, reduction="umap", label=TRUE, pt.size=0.5) + NoLegend() ``` ```{r} DimPlot(pbmc, reduction="tsne", label=TRUE, pt.size=0.5) + NoLegend() ``` ### Additional plotting capabilities Seurat comes with different graphical representations to relate the expression of chosen genes to clusters. Here are the main ones. Ridgeplots to follow gene expression distributions across clusters. ```{r} features <- c("LYZ", "CD74", "CST3", "CD3D", "FCGR3A", "GZMA") RidgePlot(pbmc, features=features, ncol=2) ``` Violin plots. ```{r} VlnPlot(pbmc, features=features[c(1,2,4,6)]) ``` Dot plots. ```{r} DotPlot(pbmc, features=features) + RotatedAxis() ``` Targeted heatmaps. ```{r} DoHeatmap(subset(pbmc, downsample=100), features=features, size=3) ``` Feature mapping with robust color scales (extreme values do not influence the color scale) ```{r} FeaturePlot(pbmc, features=c("MS4A1","PTPRCAP"), min.cutoff="q10", max.cutoff="q90") ``` ```{r} FeaturePlot(pbmc, features=c("MS4A1","CD79A"), blend=TRUE) ``` ### Saving (serializing) a Seurat object To avoid re-executing all the steps necessary to the preparation of a certain Seurat object (can be slow), you can serialize it into a file. ```{r} saveRDS(pbmc, file="pbmc_final.rds") ```