4 Datasets

Here we provide brief descriptions of the core datasets used in this course and a more detailed description of the Tabula Muris (mouse cell atlas) data, how it can be downloaded and how it can be used.

4.1 Deng

A single-cell RNA-seq dataset of 268 individual cells dissociated from in vivo F1 embryos from oocyte to blastocyst stages of mouse preimplantation development. Single-cell transcriptome profiles were generated with Smart-seq or Smart-seq2 from each individual cell with spike-ins (NB: both the Smart-seq and Smart-seq2 protocols were used, for different sets of cells in the dataset). Cells annlysed here have been annotated with their developmental stages according to the original publication.

Deng, Qiaolin, et al. “Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells.” Science 343.6167 (2014) 193-196.

4.2 Tung

A dataset of induced pluripotent stem cells generated from three different individuals with replicates (Tung et al. 2017) in Yoav Gilad’s lab at the University of Chicago. Data generated using Fluidigm C1 platform and to facilitate the quantification both unique molecular identifiers (UMIs) and ERCC spike-ins were used. The data files are located in the tung folder in your working directory. These files are the copies of the original files made on the 15/03/16. We will use these copies for reproducibility purposes.

Tung, Po-Yuan, et al. “Batch effects and the effective design of single-cell gene expression studies.” Scientific reports 7 (2017): 39921.

4.3 Pancreas

We have included two human pancreas datasets: from Muraro et al (2016) (Muraro et al. 2016) and Segerstolpe et al. (2016) (Segerstolpe et al. 2016). Since the pancreas has been widely studied, these datasets are well annotated.

4.3.1 Muraro

Single-cell CEL-seq2 data were generated using a customised automated platform that uses FACS, robotics, and the CEL-Seq2 protocol to obtain the transcriptomes of thousands of single pancreatic cells from four deceased organ donors. Cell surface markers can be used for sorting and enriching certain cell types.(Muraro et al. 2016)

Muraro,M.J. et al. (2016) A Single-Cell Transcriptome Atlas of the Human Pancreas. Cell Syst, 3, 385–394.e3.

4.3.2 Segerstolpe

Single-cell RNA-seq dataset of human pancreatic cells from patients with type 2 diabetes and healthy controls. Single cells were prepared using Smart-seq2 protocol and sequenced on an Illumina HiSeq 2000.(Segerstolpe et al. 2016)

Segerstolpe,Å. et al. (2016) Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab., 24, 593–607.

4.4 Heart

data/sce/Heart_10X.rds is a SCE object containing cells from Heart tissue from the Tabula Muris dataset (details below) using 10X protocol.

4.5 Thymus

data/sce/Thymus_10X.rds is a SCE object containing cells from Thymus tissue from the Tabula Muris dataset (details below) using 10X protocol.

4.6 Tabula Muris

4.7 Introduction

To give you hands-on experience analyzing from start to finish a single-cell RNASeq dataset we will be using as an example, data from the Tabula Muris initial release. The Tabula Muris is an international collaboration with the aim to profile every cell-type in the mouse using a standardized method. They combine high-throughput but low-coverage 10X data with lower throughput but high-coverage FACS-sorted cells + Smartseq2.

The initial release of the data (20 Dec 2017), contains almost 100,000 cells across 20 different tissues/organs. You might like to choose a tissue to focus on for a detailed analysis.

4.8 Downloading the data

Unlike most single-cell RNA-seq data Tabula Muris has released their data through the figshare platform rather than uploading it to GEO or ArrayExpress. You can find the data by using the doi’s in their paper : 10.6084/m9.figshare.5715040 for FACS/Smartseq2 and 10.6084/m9.figshare.5715025 for 10X data. The data can be downloaded manually by clinking the doi links or by using the command-line commands below:

Terminal-based download of FACS data:

wget https://ndownloader.figshare.com/files/10038307
unzip 10038307
wget https://ndownloader.figshare.com/files/10038310
mv 10038310 FACS_metadata.csv
wget https://ndownloader.figshare.com/files/10039267
mv 10039267 FACS_annotations.csv

Terminal-based download of 10X data:

wget https://ndownloader.figshare.com/files/10038325
unzip 10038325
wget https://ndownloader.figshare.com/files/10038328
mv 10038328 droplet_metadata.csv
wget https://ndownloader.figshare.com/files/10039264
mv 10039264 droplet_annotation.csv

Note if you download the data by hand you should unzip & rename the files as above before continuing.

You should now have two folders : “FACS” and “droplet” and one annotation and metadata file for each. To inspect these files you can use the head to see the top few lines of the text files (Press “q” to exit):

head -n 10 droplet_metadata.csv

You can also check the number of rows in each file using:

wc -l droplet_annotation.csv

Exercise How many cells do we have annotations for from FACS? from 10X? nn Answer FACS : 54,838 cells Droplet : 42,193 cells

4.9 Reading the data (Smartseq2)

We can now read in the relevant count matrix from the comma-separated file. Then inspect the resulting dataframe:

dat <- read.delim("FACS/Kidney-counts.csv", sep=",", header=TRUE)
dat[1:5,1:5]

We can see that the first column in the dataframe is the gene names, so first we move these to the rownames so we have a numeric matrix:

dim(dat)
rownames(dat) <- dat[,1]
dat <- dat[,-1]

Since this is a Smart-seq2 dataset it may contain spike-ins so lets check:

rownames(dat)[grep("^ERCC-", rownames(dat))]

Now we can extract much of the metadata for this data from the column names:

cellIDs <- colnames(dat)
cell_info <- strsplit(cellIDs, "\\.")
Well <- lapply(cell_info, function(x){x[1]})
Well <- unlist(Well)
Plate <- unlist(lapply(cell_info, function(x){x[2]}))
Mouse <- unlist(lapply(cell_info, function(x){x[3]}))

We can check the distributions of each of these metadata classifications:

summary(factor(Mouse))

We can also check if any technical factors are confounded:

table(Mouse, Plate)

Lastly we will read the computationally inferred cell-type annotation and match them to the cell in our expression matrix:

ann <- read.table("FACS_annotations.csv", sep=",", header=TRUE)
ann <- ann[match(cellIDs, ann[,1]),]
celltype <- ann[,3]

4.10 Building a SingleCellExperiment object

To create a SingleCellExperiment object we must put together all the cell annotations into a single dataframe, since the experimental batch (PCR plate) is completely confounded with donor mouse we will only keep one of them.

library("SingleCellExperiment")
library("scater")
cell_anns <- data.frame(mouse = Mouse, well=Well, type=celltype)
rownames(cell_anns) <- colnames(dat)
sceset <- SingleCellExperiment(assays = list(counts = as.matrix(dat)), colData=cell_anns)

Finally if the dataset contains spike-ins we a hidden variable in the SingleCellExperiment object to track them:

isSpike(sceset, "ERCC") <- grepl("ERCC-", rownames(sceset))

4.11 Reading the data (10X)

Due to the large size and sparsity of 10X data (upto 90% of the expression matrix may be 0s) it is typically stored as a sparse matrix. The default output format for CellRanger is an .mtx file which stores this sparse matrix as a column of row coordinates, a column of column corodinates, and a column of expression values > 0. Note if you look at the .mtx file you will see two header lines followed by a line detailing the total number of rows, columns and counts for the full matrix. Since only the coordinates are stored in the .mtx file, the names of each row & column must be stored separately in the “genes.tsv” and “barcodes.tsv” files respectively.

We will be using the “Matrix” package to store matrices in sparse-matrix format in R.

The SingleCellExperiment class naturally handles parse matrices, and many downstream tools including scater, scran and DropletUtils also handle data stored in sparse matrices, reducing the memory requirements for many early steps in an analysis. The SingleCellExperiment class can also use data in HDF5 format which allows large non-sparse matrices to be stored & accessed on disk in an efficient manner rather than loading the whole thing into RAM.

library("Matrix")
cellbarcodes <- read.table("droplet/Kidney-10X_P4_5/barcodes.tsv")
genenames <- read.table("droplet/Kidney-10X_P4_5/genes.tsv")
molecules <- readMM("droplet/Kidney-10X_P4_5/matrix.mtx")

Now we will add the appropriate row and column names. However, if you inspect the read cellbarcodes you will see that they are just the barcode sequence associated with each cell. This is a problem since each batch of 10X data uses the same pool of barcodes so if we need to combine data from multiple 10X batches the cellbarcodes will not be unique. Hence we will attach the batch ID to each cell barcode:

head(cellbarcodes)

rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P4_5", cellbarcodes[,1], sep="_")

Now lets get the metadata and computational annotations for this data:

meta <- read.delim("droplet_metadata.csv", sep=",", header = TRUE)
head(meta)

Here we can see that we need to use 10X_P4_5 to find the metadata for this batch, also note that the format of the mouse ID is different in this metadata table with hyphens instead of underscores and with the gender in the middle of the ID. From checking the methods section of the accompanying paper we know that the same 8 mice were used for both droplet and plate-based techniques. So we need to fix the mouse IDs to be consistent with those used in the FACS experiments.

meta[meta$channel == "10X_P4_5",]
mouseID <- "3_8_M"

Note: depending on the tissue you choose you may have 10X data from mixed samples : e.g. mouse id = 3-M-5/6. You should still reformat these to be consistent but they will not match mouse ids from the FACS data which may affect your downstream analysis. If the mice weren’t from an inbred strain it would be possible to assign individual cells to a specific mouse using exonic-SNPs but that is beyond the scope of this course.

ann <- read.delim("droplet_annotation.csv", sep=",", header=TRUE)
head(ann)

Again you will find a slight formating difference between the cellID in the annotation and the cellbarcodes which we will have to correct before matching them.

ann[,1] <- paste(ann[,1], "-1", sep="")
ann_subset <- ann[match(colnames(molecules), ann[,1]),]
celltype <- ann_subset[,3]

Now lets build the cell-metadata dataframe:

cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
rownames(cell_anns) <- colnames(molecules);

Exercise Repeat the above for the other 10X batches for your tissue.

Answer

4.12 Building a SingleCellExperiment object for the 10X data

Now that we have read the 10X data in multiple batches we need to combine them into a single SingleCellExperiment object. First we will check that the gene names are the same and in the same order across all batches:

identical(rownames(molecules1), rownames(molecules2))
identical(rownames(molecules1), rownames(molecules3))

Now we’ll check that there aren’t any repeated cellIDs:

sum(colnames(molecules1) %in% colnames(molecules2))
sum(colnames(molecules1) %in% colnames(molecules3))
sum(colnames(molecules2) %in% colnames(molecules3))

Everything is ok, so we can go ahead and combine them:

all_molecules <- cbind(molecules1, molecules2, molecules3)
all_cell_anns <- as.data.frame(rbind(cell_anns1, cell_anns2, cell_anns3))
all_cell_anns$batch <- rep(c("10X_P4_5", "10X_P4_6","10X_P7_5"), times = c(nrow(cell_anns1), nrow(cell_anns2), nrow(cell_anns3)))

Exercise How many cells are in the whole dataset?

Answer

Now build the SingleCellExperiment object. One of the advantages of the SingleCellExperiment class is that it is capable of storing data in normal matrix or sparse matrix format, as well as HDF5 format which allows large non-sparse matrices to be stored & accessed on disk in an efficient manner rather than loading the whole thing into RAM.

all_molecules <- as.matrix(all_molecules)
sceset <- SingleCellExperiment(
    assays = list(counts = as.matrix(all_molecules)),
    colData = all_cell_anns
)

Since this is 10X data it will not contain spike-ins, so we just save the data:

saveRDS(sceset, "kidney_droplet.rds")

4.13 Advanced Exercise

Write an R function/script which will fully automate this procedure for each data-type for any tissue.

References

Muraro, Mauro J., Gitanjali Dharmadhikari, Dominic Grün, Nathalie Groen, Tim Dielen, Erik Jansen, Leon van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Systems 3 (4). Elsevier BV: 385–394.e3. https://doi.org/10.1016/j.cels.2016.09.002.

Segerstolpe, Åsa, Athanasia Palasantza, Pernilla Eliasson, Eva-Marie Andersson, Anne-Christine Andréasson, Xiaoyan Sun, Simone Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metabolism 24 (4). Elsevier BV: 593–607. https://doi.org/10.1016/j.cmet.2016.08.020.