10 Clustering and cell annotation

10.1 Clustering Methods

Once we have normalized the data and removed confounders we can carry out analyses that are relevant to the biological questions at hand. The exact nature of the analysis depends on the dataset. Nevertheless, there are a few aspects that are useful in a wide range of contexts and we will be discussing some of them in the next few chapters. We will start with the clustering of scRNA-seq data.

10.1.1 Introduction

One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types based on transcription profiles. Computationally, this is a hard problem as it amounts to unsupervised clustering. That is, we need to identify groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels. Moreover, in most situations we do not even know the number of clusters a priori. The problem is made even more challenging due to the high level of noise (both technical and biological) and the large number of dimensions (i.e. genes).

When working with large datasets, it can often be beneficial to apply some sort of dimensionality reduction method. By projecting the data onto a lower-dimensional sub-space, one is often able to significantly reduce the amount of noise. An additional benefit is that it is typically much easier to visualize the data in a 2 or 3-dimensional subspace. We have already discussed PCA (chapter 6.6.2) and t-SNE (chapter 6.6.2).

Challenges in clustering

What is the number of clusters k?
What defines a good clustering?
What is a cell type?
Scalability: in the last few years the number of cells in scRNA-seq experiments has grown by several orders of magnitude from ~\(10^2\) to ~\(10^6\)

10.1.2 Unsupervised clustering methods

Three main ingredients of a complete clustering method:

Measure of similarity: how do we quantify how close two data points are?
Quality function: how do we decide how “good” is a clustering/partition?
Algorithm: how to find the clustering whose quality function is optimized?

10.1.2.1 Hierarchical clustering

Hierarchical clustering is basically the only type of clustering algorithm that does not seek to optimize a quality function, because it builds a hierarchy of clusters, instead of one single clustering result as the output.

There are two types of strategies:
- Agglomerative (bottom-up): each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive (top-down): all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

10.1.2.2 k-means clustering

Measure of similarity: Euclidean distance

Quality function: Within cluster distance

Algorithm:

Advantage: Fast

Drawbacks:
- Sensitive to initial clustering
- Sensitive to outliers
- Need to specify K
- Tend to find clusters of similar sizes

Tools related to K-means: SC3

10.1.2.3 Graph-based methods

Real world networks usualy display big inhomogeneities or community structure. Communities or clusters or modules are groups of vertices which probably share common properties and/or play similar roles whithin the graph. In recent years there has been a lot of interest in detecting communities in networks in various domains.

Some of these community detection methods can be applied to scRNA-seq data by building a graph where each vertice represents a cell and (weight of) the edge measures similarity between two cells.
Actually, graph-based clustering is the most popular clustering algorithm in scRNA-seq data analysis, and has been reported to have outperformed other clustering methods in many situations (Freytag et al. 2018).

10.1.2.3.1 Why do we want to represent the data as a graph?

Memory effectiveness: A (complete) graph can be thought as an alternative expression of similarity matrix. Current methods (discuss later) aim to build sparse graphs, which ease the memory burden.
Curse of dimensionality: All data become sparse in high-dimensional space and therefore similarities measured by Euclidean distances etc are generally low between all objects.

10.1.2.3.2 Building a graph

Step1: Build an unweighted K-nearest neighbour (KNN) graph

Step2: Add weights, and obtain a shared nearest neighbour (SNN) graph

There are two ways of adding weights: number and rank.
- number: The number of shared nodes between \(u\) and \(v\), in this case, 3.
- rank: A measurement of the closeness to their common nearest neighbours. ((???))

Details of rank :
Main idea: The closeness of two people is defined by their closest common friend.
For each node, say \(u\), we can rank its 5 neighbours according to their closeness to \(u\), and we can do the same with \(v\). Denote the three shared neighbours as \(x_1\), \(x_2\) and \(x_3\), so rank(\(x_1, u\)) = 1 means \(x_1\) is the closest neighbour of \(u\). The idea is, if \(x_1\) is also the closest to \(v\), then \(u\) and \(v\) should have a larger similarity, or weight. So we summarize the overall closeness of \(x_1\) with both \(u\) and \(v\) by taking an average: \(\dfrac{1}{2}(\text{rank}(x_1, u), \text{rank}(x_1, v))\). Then we find the one with the largest closeness, \(s(u, v) = \min \left[ \dfrac{1}{2}(\text{rank}(x_i, u), \text{rank}(x_i, v)) \vert i = 1, 2, 3\right]\). The final expression of weight: \[ w(u, v) = K - s(u, v).\]

10.1.2.3.3 Quality function (Modularity)

Modularity (Newman and Girvan 2004) is not the only quality function for graph-based clustering, but it is one of the first attempts to embed in a compact form many questions including the definition of quality function and null model etc.

The idea of modularity: A random graph should not have a cluster structure.
The more “quality” a partition has compared to a random graph, the “better” the partition is.
Specifically, it is defined by:

the quality of a partition on the actual graph \(-\) the quality of the same partition on a random graph

quality : Sum of the weights within clusters
random graph : a copy of the original graph, with some of its properties, but without community structure. The random graph defined by modularity is: each node has the same degree as the original graph.

\[ Q \propto \sum_{i, j} A_{i, j} \delta(i, j) - \sum_{i, j} \dfrac{k_i k_j}{2m} \delta(i, j)\]

\(A_{i, j}\): weight between node \(i\) and \(j\);
\(\delta(i, j)\): indicator of whether \(i\) and \(j\) are in the same cluster;
\(k_i\): the degree of node \(i\) (the sum of weights of all edges connected to \(i\));
\(m\): the total weight in the all graph.

Higher modularity implies better partition:

Limits of modularity: (Good, De Montjoye, and Clauset 2010)
1. Resolution limit.
Short version: Modularity maximization forces small communities into larger ones.
Longer version: For two clusters \(A\) and \(B\), if \(k_A k_B < 2m\) then modularity increases by merging A and B into a single cluster, even if A and B are distinct clusters.
2. Bad, even random partitions may have a high modularity.

Networks lack a clear modularity maxima.

10.1.2.3.4 Algorithms :

Modularity-based clustering methods implemented in single cell analysis are mostly greedy algorithms, that are very fast, although not the most accurate approaches.

Louvain: (Blondel et al. 2008)

Leiden:(Traag, Waltman, and Eck 2019)
Improved Louvain, hybrid of greedy algorithm and sampling technique

10.1.2.3.5 Advantages:

-Fast
-No need to specify \(k\)

10.1.2.3.6 Tools for graph-based clustering:

Seurat: Louvain, Leiden, SLM
igraph: fast greedy, Louvain, optimal, walktrap, spinglass, infomap

10.1.2.4 Consensus clustering (more robustness, less computational speed)

10.1.2.4.1 Motivation (Two problems of \(K\)-means):

Problem1: sensitive to initial partitions
Solution: Run multiple iterations of \(K\)-means on different subsamples of the original dataset, with different initail partitions.
Problem2: the selection of \(K\).
Solution: Run \(K\)-means with a range of \(K\)’s.

10.1.2.4.2 Algorithm of consensus clustering (simpliest version):

for(k in the range of K){
  for(each subsample of the data){
    for(iteration in 1:1000){
      kmeans(subsample, k) # each iteration means a different initial partition
      save partition
    }
  }
  return consensus clustering result of k
}

10.1.2.4.3 Subsample obtained by dimensional reduction:

steps of PCA:
i) tranformation of the similarity matrix.
ii) ranking eigen vectors according to their accoring eigen values in decreasing order.
iii) need to decide how many (\(d\)) PC’s or eigenvalues we wants to reduce to.

In SC3,
i) considers two types of transformation: the one with traditional PCA and the associated graph Laplacian.
iii) User may specify a range of \(d\), or use the default range suggested by the authors according to their experience with empirical results.

10.1.2.4.4 Consensus clustering (combining multiple clustering results):

Step1: Represent each partition as a matrix:
Say we partitioned four data points into 2 clusters.

Step2: Consensus matrix:
Average of all the partitions

10.1.2.4.5 Tools for consensus clustering: SC3

10.2 Clustering example

library(pcaMethods)
library(SC3)
library(scater)
library(SingleCellExperiment)
library(pheatmap)
library(mclust)
library(igraph)
library(scran)

10.2.1 Example 1. Graph-based clustering (deng dataset)

To illustrate clustering of scRNA-seq data, we consider the Deng dataset of cells from developing mouse embryo (Deng et al. 2014). We have preprocessed the dataset and created a SingleCellExperiment object in advance. We have also annotated the cells with the cell types identified in the original publication (it is the cell_type2 column in the colData slot).

deng <- readRDS("data/deng/deng-reads.rds")

First, we build a \(K\)-NN graph with a package function from scran. The most important decision of building a graph is the choice of \(K\), of which there is no standard rule. In general, we can think of it as an indication of the desired cluster size. If \(K\) is too small, a genuine cluster might be split into parts, while if \(K\) is too large, clusters might not thoroughly separated.

deng5 <- buildSNNGraph(deng, k = 5)
deng15 <- buildSNNGraph(deng, k = 15)
deng25 <- buildSNNGraph(deng, k = 25)
par(mfrow=c(1,3))
plot(deng5, vertex.size = 4, vertex.label = NA)
title("5-NN" ,line = -33, cex.main = 3)
plot(deng15, vertex.size = 4, vertex.label = NA)
title("15-NN" ,line = -33, cex.main = 3)
plot(deng25, vertex.size = 4, vertex.label = NA)
title("25-NN" ,line = -33, cex.main = 3)

Perform Louvain clustering:

cl <- igraph::cluster_louvain(deng15)$membership
colData(deng)$cl <- factor(cl)
mclust::adjustedRandIndex(colData(deng)$cell_type2, colData(deng)$cl)

## [1] 0.4197754

Reaches very high similarity with the labels provided in the original paper.

However, it tend to merge small clusters into larger ones.

table(deng$cell_type2, cl)

##             cl
##               1  2  3
##   16cell     49  0  1
##   4cell       0 14  0
##   8cell      36  0  1
##   early2cell  0  8  0
##   earlyblast  0  0 43
##   late2cell   0 10  0
##   lateblast   0  0 30
##   mid2cell    0 12  0
##   midblast    0  0 60
##   zy          0  4  0

10.2.2 Example 2. Graph-based clustering (segerstolpe dataset)

muraro <- readRDS("data/pancreas/muraro.rds")
## PCA
var.fit <- suppressWarnings(trendVar(muraro, parametric=TRUE, use.spikes=F))
muraro <- suppressWarnings(denoisePCA(muraro, technical=var.fit$trend))
dim(reducedDim(muraro, "PCA"))

## [1] 2126    5

## Build graph and clustering
gr <- buildSNNGraph(muraro, use.dimred="PCA", k = 30)
cl <- igraph::cluster_louvain(gr)$membership
colData(muraro)$cl <- factor(cl)
mclust::adjustedRandIndex(colData(muraro)$cell_type1, colData(muraro)$cl)

## [1] 0.4845618

table(muraro$cell_type1, cl)

##              cl
##                 1   2   3   4   5   6   7   8   9
##   acinar        0   0   0   0   0   0 218   0   1
##   alpha       202 306 274   5  15   9   1   0   0
##   beta          1   0   0   5 195  21   2 220   4
##   delta         0   0   0   0  18 174   0   1   0
##   ductal        0   0   0 215   0   1   7   3  19
##   endothelial   0   0   0   0   0   0   0   0  21
##   epsilon       0   0   0   0   0   3   0   0   0
##   gamma         1   0   1   0   0  97   2   0   0
##   mesenchymal   0   0   0   1   0   0   0   0  79
##   unclear       0   0   0   4   0   0   0   0   0

10.2.3 Example 3. SC3

Let’s run SC3 clustering on the Deng data. The advantage of the SC3 is that it can directly ingest a SingleCellExperiment object.

SC3 can estimate a number of clusters:

deng <- sc3_estimate_k(deng)

## Estimating k...

metadata(deng)$sc3$k_estimation

## [1] 6

Next we run SC3 (we also ask it to calculate biological properties of the clusters):

deng <- sc3(deng, ks = 10, biology = TRUE, n_cores = 1)

## Setting SC3 parameters...

## Calculating distances between the cells...

## Performing transformations and calculating eigenvectors...

## Performing k-means clustering...

## Calculating consensus matrix...

## Calculating biology...

SC3 result consists of several different outputs (please look in (Kiselev et al. 2017) and SC3 vignette for more details). Here we show some of them:

Consensus matrix:

sc3_plot_consensus(deng, k = 10, show_pdata = "cell_type2")

Silhouette plot:

sc3_plot_silhouette(deng, k = 10)

Heatmap of the expression matrix:

sc3_plot_expression(deng, k = 10, show_pdata = "cell_type2")

Identified marker genes:

sc3_plot_markers(deng, k = 10, show_pdata = "cell_type2")

PCA plot with highlighted SC3 clusters:

plotPCA(deng, colour_by = "sc3_10_clusters")

Compare the results of SC3 clustering with the original publication cell type labels:

adjustedRandIndex(colData(deng)$cell_type2, colData(deng)$sc3_10_clusters)

## [1] 0.7804189

Note SC3 can also be run in an interactive Shiny session:

sc3_interactive(deng)

This command will open SC3 in a web browser.

Note Due to direct calculation of distances SC3 becomes very slow when the number of cells is \(>5000\). For large datasets containing up to \(10^5\) cells we recomment using Seurat (see chapter 16).

10.3 An alternative to clustering: Automatic cell annotation

10.3.1 SingleR

10.3.1.1 Methodology

Step1. Find variable gene
1-1. For every gene, obtain median grouped by label. 1-2. Select genes that makes at least one label different:. For example, if we are looking for the genes that makes label “green” different from label “red”, we substract the second column by the first, and pick the top \(N\) highest and positive values. All analysis onwards use only the selected variable genes.
Step2. Spearman’s correlation
Spearman’s correlation \(\in [-1, 1]\) is a measure of the strength of a linear or monotonic relationship between paired data. We compute the Spearman’s correlation for all pairs of cells in the test and reference dataset, and obtain an \(n_{\text{test}} \times n_{\text{ref}}\) correlation matrix, where \(n\) is the number of cells (see the first matrix in Step3).
Step3. Scoring
We want to know how each cell in the test data is correlated to the labels in the reference data, instead of each reference cell. So we take the correlations of a cell in the test data with all the cells with a certain label in the reference data, and summarize them into one number or a score, in SingleR, the default is to take the \(80\%\) quantile.
Step4. Fine tuning
We stop here and assign each cell with label that score the highest, actually, if we set the argument fine.tune = FALSE, that is exactly what the package function SingleR does. But there is one more question, what if the second highest score is very close to the highest? say, 1, 1, 1, 9.5, 10. SingleR set a threshold to define how close is “very close”, the default is 0.05. For (only) the cells that falls into this category, it goes back to Step2.

10.3.1.2 Example

(Note: SingleR is not yet available in the released version of Bioconductor. It will be possible to run it as shown once the next Bioconductor release is made in late October.)

library(scRNAseq)
library(SingleR)
segerstolpe <- readRDS("data/pancreas/segerstolpe.rds")
sceM <- suppressMessages(MuraroPancreasData())
sceM <- sceM[,!is.na(sceM$label)]
sceM <- logNormCounts(sceM)
## find common gene
rownames(sceM) <- gsub("__.*","",rownames(sceM))
common <- intersect(rownames(sceM), rownames(segerstolpe))
sceM <- sceM[common,]
segerstolpe <- segerstolpe[common,]
## Prepare reference
out <- pairwiseTTests(logcounts(sceM), sceM$label, direction="up")
markers <- getTopMarkers(out$statistics, out$pairs, n=10)
## Annotation
pred <- SingleR(test=segerstolpe, ref=sceM, labels=sceM$label, genes=markers)
## View result 
plotScoreHeatmap(pred, show.labels = TRUE,  annotation_col=data.frame(
  row.names=rownames(pred)))

10.3.2 scmap

## Load data
segerstolpe <- readRDS("data/pancreas/segerstolpe.rds")  # test
library(scRNAseq)
sceM <- readRDS("data/pancreas/muraro.rds") # reference
rownames(sceM) <- gsub("__.*","",rownames(sceM))

Select the most informative features (genes) using the dropout feature selection method. By default select 500 features.

library(scmap)
rowData(sceM)$feature_symbol <- rownames(sceM)
sceM <- selectFeatures(sceM, suppress_plot = TRUE)

Index of a reference dataset is created by finding the median gene expression for each cluster. First, chop the total of 500 features into \(M = 50\) chuncks/ low-dimensional subspace. Second, cluster each chunk into \(k = \sqrt{N}\) clusters, where \(N\) is the number of cells.

By default scmap uses the cell_type1 column of the colData slot in the reference to identify clusters.

sceM <- indexCell(sceM)

The function indexCluster writes the scmap_cluster_index item of the meta data slot of the reference dataset sceM. This step has two outputs:

names(metadata(sceM)$scmap_cell_index)

## [1] "subcentroids" "subclusters"

subcentroids returns cluster centers:

cat(length(metadata(sceM)$scmap_cell_index$subcentroids), " chunks \n")

## 50  chunks

cat("The dimension of cluster centers in each chunk: ", dim(metadata(sceM)$scmap_cell_index$subcentroids[[1]]), "\n")

## The dimension of cluster centers in each chunk:  10 46

subclusters contains information about which cluster (label) the cells belong to

dim(metadata(sceM)$scmap_cell_index$subclusters)

## [1]   50 2126

metadata(sceM)$scmap_cell_index$subclusters[1:5,1:5]

##      D28.1_1 D28.1_13 D28.1_15 D28.1_17 D28.1_2
## [1,]       8       14        5        3       3
## [2,]      15       40        8       35       4
## [3,]       7       43       21        1      20
## [4,]      27       42       34       27      30
## [5,]      45       38        4       41      36

Projection: Once the scmap-cell indexes have been generated we can use them to project the test dataset.

scmapCell_results <- scmapCell(
  projection = segerstolpe,
  index_list = list(
    sceM = metadata(sceM)$scmap_cell_index
  ) )
names(scmapCell_results)

## [1] "sceM"

The cells matrix contains the top 10 (scmap default) cell IDs of the cells of the reference dataset that a given cell of the projection dataset is closest to:

dim(scmapCell_results$sceM$cells)

## [1]   10 3514

Cell annotation: If cell cluster annotation is available for the reference datasets, scmap-cell can also annotate the cells from the projection dataset using the labels of the reference. It does so by looking at the top 3 nearest neighbours (scmap default) and if they all belong to the same cluster in the reference and their maximum similarity is higher than a threshold (0.5 is the scmap default), then a projection cell is assigned to the corresponding reference cluster:

scmapCell_clusters <- scmapCell2Cluster(
  scmapCell_results,
  list(
    colData(sceM)$cell_type1
  ))

Plot result Compare the annotated result with the original label in the segerstolpe dataset.

plot(
  getSankey(
    segerstolpe$cell_type1,
    scmapCell_clusters$combined_labs, plot_height = 400 )
)

10.3.3 sessionInfo()

References

Blondel, Vincent D, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast Unfolding of Communities in Large Networks.” Journal of Statistical Mechanics: Theory and Experiment 2008 (10). IOP Publishing: P10008.

Deng, Q., D. Ramskold, B. Reinius, and R. Sandberg. 2014. “Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells.” Science 343 (6167). American Association for the Advancement of Science (AAAS): 193–96. https://doi.org/10.1126/science.1245316.

Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell Rna-Sequencing Data.” F1000Research 7. Faculty of 1000 Ltd.

Good, Benjamin H, Yves-Alexandre De Montjoye, and Aaron Clauset. 2010. “Performance of Modularity Maximization in Practical Contexts.” Physical Review E 81 (4). APS: 046106.

Kiselev, Vladimir Yu, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, et al. 2017. “SC3: Consensus Clustering of Single-Cell RNA-Seq Data.” Nat Meth 14 (5). Springer Nature: 483–86. https://doi.org/10.1038/nmeth.4236.

Newman, Mark EJ, and Michelle Girvan. 2004. “Finding and Evaluating Community Structure in Networks.” Physical Review E 69 (2). APS: 026113.

Traag, Vincent A, Ludo Waltman, and Nees Jan van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9. Nature Publishing Group.