8 Handling sparsity

The material below is reproduced from (Laehnemann et al. 2019):

  • Laehnemann,D. et al. (2019) 12 Grand challenges in single-cell data science PeerJ Preprints. link

8.1 Challenge: Handling sparsity in single-cell RNA sequencing

A comprehensive characterization of the transcriptional status of individual cells enables us to gain full insight into the interplay of transcripts within single cells. However, scRNA-seq measurements typically suffer from large fractions of observed zeros, where a given gene in a given cell has no unique molecule identifiers or reads mapping to it. These observed zero values can represent either missing data (i.e.~a gene is expressed but not detected by the sequencing technology) or true absence of expression. The proportion of zeros, or degree of sparsity, is thought to be due to imperfect reverse transcription and amplification, and other technical limitations (), and depends on the scRNA-seq platform used, the sequencing depth and the underlying expression level of the gene. The term ``dropout’’ is often used to denote observed zero values in scRNA-seq data, but this term conflates zero values attributable to methodological noise and biologically-true zero expression, so we recommend against its use as a catch-all term for observed zeros.

Sparsity in scRNA-seq data can hinder downstream analyses, but it is challenging to model or handle it appropriately, and thus, there remains an ongoing need for improved methods. Sparsity pervades all aspects of scRNA-seq data analysis, but here we focus on the linked problems of learning latent spaces and “imputing” expression values from scRNA-seq data. Imputation, “data smoothing” and “data reconstruction” approaches are closely linked to the challenges of normalization. But whereas normalization generally aims to make expression values between cells more comparable to each other, imputation and data smoothing approaches aim to achieve adjusted data values that—it is hoped—better represent the true expression values. Imputation methods could therefore be used for normalization, but do not entail all possible or useful approaches to normalization.

8.2 Status

The imputation of missing values has been very successful for genotype data. Crucially, when imputing genotypes we often know which data are missing (e.g.~when no genotype call is possible due to no coverage of a locus, although see section for the challenges with data) and rich sources of external information are available (e.g.~haplotype reference panels). Thus, genotype imputation is now highly accurate and a commonly-used step in data processing for genetic association studies .

The situation is somewhat different for scRNA-seq data, as we do not routinely have external reference information to apply. In addition, we can never be sure which observed zeros represent “missing data” and which accurately represent a true gene expression level in the cell . Observed zeros can either represent “biological” zeros, i.e.~those present because the true expression level of a gene in a cell was zero. Or they they are the result of methodological noise, which can arise when a gene has true non-zero expression in a cell, but no counts are observed due to failures at any point in the complicated process of processing mRNA transcripts in cells into mapped reads. Such noise can lead to artefactual zero that are either more systematic (e.g.~sequence-specific mRNA degradation during cell lysis) or that occur by chance (e.g.~barely expressed transcripts that at the same expression level will sometimes be detected and sometimes not, due to sampling variation, e.g~in the sequencing). The high degree of sparsity in scRNA-seq data therefore arises from technical zeros and true biological zeros, which are difficult to distinguish from one another.

In general, two broad approaches can be applied to tackle this problem of sparsity:

  1. use statistical models that inherently model the sparsity, sampling variation and noise modes of scRNA-seq data with an appropriate data generative model; or
  2. attempt to ``impute’’ values for observed zeros (ideally the technical zeros; sometimes also non-zero values) that better approximate the true gene expression levels.

We prefer to use the first option where possible, and for many single-cell data analysis problems, statistical models appropriate for sparse count data exist and should be used (e.g.~for differential expression analysis). However, there are many cases where the appropriate models are not available and accurate imputation of technical zeros would allow better results from downstream methods and algorithms that cannot handle sparse count data. For example, imputation could be particularly useful for many dimension reduction, visualization and clustering applications. It is therefore desirable to improve both statistical methods that work on sparse count data directly and approaches for data imputation for scRNA-seq data, whether by refining existing techniques or developing new ones (see also ).

We define three broad (and sometimes overlapping) categories of methods that can be used to ``impute’’ scRNA-seq data in the absence of an external reference:

  1. Model-based imputation methods of technical zeros use probabilistic models to identify which observed zeros represent technical rather than biological zeros and aim to impute expression levels just for these technical zeros, leaving other observed expression levels untouched; or
  2. Data-smoothing methods define sets of “similar” cells (e.g.~cells that are neighbors in a graph or occupy a small region in a latent space) and adjust expression values for each cell based on expression values in similar cells. These methods adjust all expression values, including technical zeros, biological zeros and observed non-zero values.
  3. Data-reconstruction methods typically aim to define a latent space representation of the cells. This is often done through matrix factorization (e.g.~principal component analysis) or, increasingly, through machine learning approaches (e.g.~variational autoencoders that exploit deep neural networks to capture non-linear relationships). Although a broad class of methods, both matrix factorization methods and autoencoders (among others) are able to “reconstruct” the observed data matrix from low-rank or simplified representations. The reconstructed data matrix will typically no longer be sparse (with many zeros) and the implicitly “imputed” data can be used for downstream applications that cannot handle sparse count data.

The first category of methods generally seeks to infer a probabilistic model that captures the data generation mechanism. Such generative models can be used to identify, probabilistically, which observed zeros correspond to technical zeros (to be imputed) and which correspond to biological zeros (to be left alone). There are many model-based imputation methods already available that use ideas from clustering (e.g.~k-means), dimension reduction, regression and other techniques to impute technical zeros, oftentimes combining ideas from several of these approaches. These include SAVER , ScImpute , bayNorm , scRecover , and VIPER . Clustering methods that implicitly impute values, such as CIDR and BISCUIT , are closely related to this class of imputation methods.

Data-smoothing methods, which adjust all gene expression levels based on expression levels in “similar” cells, have also been proposed to handle imputation problems. We might regard these approaches as “denoising” methods. To take a simplified example, we might imagine that single cells originally refer to points in two-dimensional space, but are likely to describe a one-dimensional curve; projecting data points onto that curve eventually allows imputation of the “missing” values (but all points are adjusted, or smoothed, not just true technical zeros). Prominent data-smoothing approaches to handling sparse counts include:

  • diffusion-based MAGIC
  • k-nearest neighbor-based knn-smooth
  • network diffusion-based netSmooth
  • clustering-based DrImpute
  • locality sensitive imputation in LSImpute

A major task in the analysis of high-dimensional single-cell data is to find low-dimensional representations of the data that capture the salient biological signals and render the data more interpretable and amenable to further analyses. As it happens, the matrix factorization and latent-space learning methods used for that task also provide another route for imputation through their ability to reconstruct the observed data matrix from simplified representations of it. PCA is one such standard matrix factorization method that can be applied to scRNA-seq data (preferably after suitable data normalization) as are other widely-used general statistical methods like ICA and NMF. As (linear) matrix factorization methods, PCA, ICA and NMF decompose the observed data matrix into a “small” number of factors in two low-rank matrices, one representing cell-by-factor weights and one gene-by-factor loadings. Many matrix factorization methods with tweaks for single-cell data have been proposed in recent years, including:

  • ZIFA, a zero-inflated factor analysis
  • f-scLVM, a sparse Bayesian latent variable model
  • GPLVM, a Gaussian process latent variable model
  • ZINB-WaVE, a zero-inflated negative binomial factor model
  • scCoGAPS, an extension of
  • consensus , a meta-analysis approach to
  • pCMF, probabilistic count matrix factorization with a Poisson model
  • SDA, sparse decomposition of arrays; another sparse Bayesian method .

Some data reconstruction approaches have been specifically proposed for imputation, including:

  • ENHANCE, denoising with an aggregation step
  • ALRA, SVD with adaptive thresholding
  • scRMD, robust matrix decomposition

Recently, machine learning methods have emerged that apply autoencoders and deep neural networks ) or ensemble learning ) to impute expression values.

Additionally, many deep learning methods have been proposed for single-cell data analysis that can, but need not, use probabilistic data generative processes to capture low-dimensional or latent space representations of a dataset. Even if imputation is not a main focus, such methods can generate ``imputed’’ expression values as an upshot of a model primarily focused on other tasks like learning latent spaces, clustering, batch correction, or visualization (and often several of these tasks simultaneously). The latter set includes tools such as:

  • DCA, an autoencoder with a zero-inflated negative binomial distribution
  • scVI, a variational autoencoder with a zero-inflated negative binomial model
  • LATE
  • VASC
  • compscVAE
  • scScope
  • Tybalt
  • SAUCIE
  • scvis
  • net-SNE
  • BERMUDA, focused on batch correction
  • DUSC
  • Expression Saliency
  • others

Besides the three categories described above, a small number of scRNA-seq imputation methods have been developed to incorporate information external to the current dataset for imputation. These include: ADImpute , which uses gene regulatory network information from external sources; SAVER-X , a transfer learning method for denoising and imputation that can use information from atlas-type resources; and methods that borrow information from matched bulk RNA-seq data like URSM and SCRABBLE .

8.3 Open problems

A major challenge in this context is the circularity that arises when imputation solely relies on information that is internal to the imputed dataset. This circularity can artificially amplify the signal contained in the data, leading to inflated correlations between genes and/or cells. In turn, this can introduce false positives in downstream analyses such as differential expression testing and gene network inference . Handling batch effects and potential confounders requires further work to ensure that imputation methods do not mistake unwanted variation from technical sources for biological signal. In a similar vein, single-cell experiments are affected by various uncertainties (see ). Approaches that allow quantification and propagation of the uncertainties associated with expression measurements (), may help to avoid problems associated with ‘overimputation’ and the introduction of spurious signals noted by .

To avoid this circularity, it is important to identify reliable external sources of information that can inform the imputation process. One possibility is to exploit external reference panels (like in the context of genetic association studies). Such panels are not generally available for scRNA-seq data, but ongoing efforts to develop large scale cell atlases could provide a valuable resource for this purpose. Systematic integration of known biological network structures is desirable and may also help to avoid circularity. A possible approach is to encode network structure knowledge as prior information, as attempted in netSmooth and ADImpute. Another alternative solution is to explore complementary types of data that can inform scRNA-seq imputation. This idea was adopted in SCRABBLE and URSM, where an external reference is defined by bulk expression measurements from the same population of cells for which imputation is performed. Yet another possibility could be to incorporate orthogonal information provided by different types of molecular measurements (see ). Methods designed to integrate multi-omics data could then be extended to enable scRNA-seq imputation, e.g.~through generative models that explicitly link scRNA-seq with other data types or by inferring a shared low-dimensional latent structure that could be used within a data-reconstruction framework.

With the proliferation of alternative methods, comprehensive benchmarking is urgently required as for all areas of single-cell data analysis . Early attempts by and provide valuable insights into the performance of methods available at the time. But many more methods have since been proposed and even more comprehensive benchmarking platforms are needed. Many methods, especially those using deep learning, depend strongly on choice of hyperparameters . There, more detailed comparisons that explore parameter spaces would be helpful, extending work like that from comparing dimensionality reduction methods. Learning from exemplary benchmarking studies , it would be immensely beneficial to develop a community-supported benchmarking platform with a wide-range of synthetic and experiment ground-truth datasets (or as close as possible, in the case of experimental data) and a variety of thoughtful metrics for evaluating performance. Ideally, such a benchmarking platform would remain dynamic beyond an initial publication to allow ongoing comparison of methods as new approaches are proposed. Detailed benchmarking would also help to establish when normalization methods derived from explicit count models may be preferable to imputation.

Finally, scalability for large numbers of cells remains an ongoing concern for imputation, data smoothing and data reconstruction methods, as for all high-throughput single-cell methods and software (see ).

References

Laehnemann, David, Johannes Köster, Ewa Szcureck, Davis McCarthy, Stephanie C Hicks, Mark D Robinson, Catalina A Vallejos, et al. 2019. “12 Grand challenges in single-cell data science.” e27885v1. PeerJ Preprints; PeerJ Inc. https://doi.org/10.7287/peerj.preprints.27885v1.