griph identifies cell types in single cell RNA-seq datasets, allowing to control for unwanted sources of variance (e.g. cell cycle) that may confound the cell types. The most important function that combines all the steps of cell type identification from griph is griph_cluster. The vignette illustrates its use applied to several publicly available datasets.
griph 0.1.1
Graph Inference of Population Heterogeneity (griph) is an R package for the analysis of single cell RNA-sequencing data. It can be used to automatically identify different cell types or states, even in the presence of confounding sources of variance such as cell cycle stages or batch effects.
griph is currently available through https://github.com. It can be installed using:
library(devtools)
install_git("git://github.com/ppapasaikas/griph.git", subdir = "griph")
In order for griph to work, some additional packages may have to be installed. You’ll need all the ones under “Needed”. You can do without the ones under “Optional”, as they are only used in the vignette:
## Needed :
## Matrix, methods, igraph, QUIC, coop, corpcor, foreach, doParallel, parallel, bigmemory, RColorBrewer, Rcpp, RcppProgress, RcppArmadillo, Matrix
## Optional :
## BiocStyle, knitr, rmarkdown, testthat, Rtsne, gtools
We will use human induced pluripotency cells from Tung et al. for this example (see details in the “Sample datasets” section). The dataset is included in the package, together with the iPS cell line labels:
M <- readRDS(system.file("extdata", "tung_UMIs_top10k.rds", package = "griph"))
dim(M) # genes by cells
## [1] 10000 864
trueLabel <- attr(M, "label2")
table(trueLabel)
## trueLabel
## NA19098 NA19101 NA19239
## 288 288 288
Cell types can be identified using griph_cluster:
library(griph)
res <- griph_cluster(M, ClassAssignment = trueLabel, use.par = FALSE,
plot = TRUE, fsuffix = 'tung')
table(res$MEMB)
##
## 1 2 3 4 5 6
## 102 99 94 282 98 189
The automatic generation of a 2D-embedding plot (plot
argument) is switched on here and will create a png (default) or pdf (image.format
and fsuffix
arguments) file in the current working directory:
list.files("./", pattern = "tung")
## [1] "graph_tung.png" "Lvis_tung.png"
The estimated cell graph (as an igraph object) and the coordinages of the 2D-embedding (when using plot=TRUE
) are returned by griph_cluster:
summary(res$GRAO) # full graph
## IGRAPH a1af978 U-W- 864 41672 --
## + attr: class (v/c), membership (v/n), community.size (v/n), labels
## | (v/c), weight (e/n)
head(res$plotLVis) # 2D embedding
## x y
## NA19098.r1.A01 -12.422459 -12.15378
## NA19098.r1.A02 -13.228306 -11.99606
## NA19098.r1.A03 -9.097008 -12.05729
## NA19098.r1.A04 -11.683206 -15.24782
## NA19098.r1.A05 -13.078734 -12.74092
## NA19098.r1.A06 -12.846079 -14.07115
In the graph returned by griph_cluster, cells are represented by graph vertices. All cells are contained in the full graph (res$GRAO
), and a representatitve subset of cells in the simplified graph (res$plotGRAO
), which will be generated by plotGraph and used for plotting. The graph can be easily plotted again, for example to separately illustrate known and predicted cell types:
g.true <- plotGraph(res, fill.type = "true", line.type = "none")
res$plotGRAO <- g.true # store in res for re-use
g.pred <- plotGraph(res, fill.type = "predicted", line.type = "none")
In plotGRAO, weak edges are pruned, vertex attributes are added for visualization of predicted and known class assignments (if given) and a subset of the vertices is sampled if the graph exceeds the maxG
argument. When plotG
is set to FALSE
, the plotGRAO component in the return value of griph_cluster is NULL
and the simplified graph is generated later when necessary by plotGraph.
It is also possible to draw polygons around cells of each class:
g <- plotGraph(res, line.type = "none", mark.type = "true")
… or to collapse all the cells that are of the same class into a single vertex in the graph (maybe useful for large graphs with many predicted classes):
g <- plotGraph(res, collapse.type = "predicted")
Alternatively (typically for large graph with more than 1000 cells), results can be visualized by applying a dimensionality reduction/projection technique such as t-SNE to the affinity matrix returned by griph:
library(Rtsne)
out <- Rtsne(as.dist(1-res$DISTM), pca = FALSE,
perplexity = 5, is_distance = TRUE)
plot(out$Y, xlab="tSNE-1", ylab="tSNE-2", col=res$MEMB,
pch=as.numeric(trueLabel))
or more conveniently using the wrapper function provided by griph
:
plotTsne(res, fill.type = "predicted", line.type = "none", mark.type = "true",
perplexity = 7)
For even larger graphs (thousands of cells), a more efficient visualiztion uses LargeVis (also see , implementing the LargeVis algorithm described by Tang et al. (2016) in )):
plotLVis(res, fill.type = "predicted", line.type = "none", mark.type = "true")
Finally, the plotting functions can also map numerical or categorical values to the color of each cell. An example is to visualize the fraction of detected genes:
plotLVis(res, fill.type = colMeans(M > 0), line.type = "none", mark.type = "true")
griph_cluster identified 6 cell types in the data, and the confusion matrix returned in the result summarizes how they relate to the known cell types or states:
res$ConfMatrix # confusion matrix
## true
## predicted NA19098 NA19101 NA19239
## 1 100 2 0
## 2 96 3 0
## 3 92 2 0
## 4 0 281 1
## 5 0 0 98
## 6 0 0 189
When comparing different classifications with each other, they may differ in granularity, yet be consistent with each other. This can be seen for example in the confusion matrix above where for example NA190998 cells are found in multiple predicted classes. griph calculates the classification error by assigning each identified class to the major known class (for example NA19098 for predicted class 2) and only counting the 3 cells in 2 as wrongly classified that are not NA19098:
res$miscl # misclassification error
## [1] 0.009259259
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Nat Biotechnol. 2015 Feb;33(2):155-60. doi: 10.1038/nbt.3102.
fname <- system.file("extdata", "buettner_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000 288
label <- attr(M, "label")
table(label)
## label
## G1 G2M S
## 96 96 96
Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation. Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, Marioni JC, Teichmann SA. Cell Stem Cell. 2015 Oct 1;17(4):471-85. doi: 10.1016/j.stem.2015.09.011.
fname <- system.file("extdata", "kolodziejck_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000 704
label <- attr(M, "label")
table(label)
## label
## lif 2i a2i
## 250 295 159
Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P. Nat Neurosci. 2015 Jan;18(1):145-53. doi: 10.1038/nn.3881.
fname <- system.file("extdata", "usoskin_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000 799
label <- attr(M, "label")
table(label)
## label
## Central, unsolved NF NF outlier NoN
## 39 139 9 109
## NoN outlier NP PEP TH
## 2 169 81 233
## TH outlier
## 18
label2 <- attr(M, "label2")
table(label2)
## label2
## Central, unsolved NF1 NF2/3 NF4/5
## 39 31 60 48
## NF outlier NoN NoN outlier NP1
## 9 109 2 125
## NP2 PEP1 PEP2 TH
## 44 64 17 233
## TH outlier
## 18
label3 <- attr(M, "label3")
table(label3)
## label3
## Central, unsolved NF1 NF2 NF3
## 39 31 48 12
## NF4 NF5 NF outlier NoN
## 22 26 9 109
## NoN outlier NP1 NP2 NP3
## 2 125 32 12
## PEP1 PEP2 TH TH outlier
## 64 17 233 18
Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Science. 2015 Mar 6;347(6226):1138-42. doi: 10.1126/science.aaa1934.
fname <- system.file("extdata", "zeisel_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000 3005
label <- attr(M, "label")
table(label)
## label
## astrocytes_ependymal endothelial-mural interneurons
## 224 235 290
## microglia oligodendrocytes pyramidal CA1
## 98 820 939
## pyramidal SS
## 399
label2 <- attr(M, "label2")
table(label2)
## label2
## Astro1 Astro2 CA1Pyr1 CA1Pyr2 CA1PyrInt CA2Pyr2 Choroid
## 68 61 380 447 49 41 10
## ClauPyr Epend Int1 Int10 Int11 Int12 Int13
## 5 20 12 21 10 21 15
## Int14 Int15 Int16 Int2 Int3 Int4 Int5
## 22 18 20 24 10 15 20
## Int6 Int7 Int8 Int9 Mgl1 Mgl2 (none)
## 22 23 26 11 17 16 189
## Oligo1 Oligo2 Oligo3 Oligo4 Oligo5 Oligo6 Peric
## 45 98 87 106 125 359 21
## Pvm1 Pvm2 S1PyrDL S1PyrL23 S1PyrL4 S1PyrL5 S1PyrL5a
## 32 33 81 74 26 16 28
## S1PyrL6 S1PyrL6b SubPyr Vend1 Vend2 Vsmc
## 39 21 22 32 105 62
Batch effects and the effective design of single-cell gene expression studies. Po-Yuan Tung, John D. Blischak, Chiaowen Joyce Hsiao, David A. Knowles, Jonathan E. Burnett, Jonathan K. Pritchard & Yoav Gilad. Scientific Reports 7, Article number: 39921 (2017). doi:10.1038/srep39921.
fname <- system.file("extdata", "tung_UMIs_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000 864
label <- attr(M, "label")
table(label)
## label
## NA19098.r1 NA19098.r2 NA19098.r3 NA19101.r1 NA19101.r2 NA19101.r3 NA19239.r1
## 96 96 96 96 96 96 96
## NA19239.r2 NA19239.r3
## 96 96
label2 <- attr(M, "label2")
table(label2)
## label2
## NA19098 NA19101 NA19239
## 288 288 288
Here is the output of sessionInfo() on the system on which this document was compiled:
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rtsne_0.13 griph_0.1.1 BiocStyle_2.4.1
##
## loaded via a namespace (and not attached):
## [1] igraph_1.1.2 Rcpp_0.12.12 bigmemory.sri_0.1.3
## [4] knitr_1.17 magrittr_1.5 doParallel_1.0.10
## [7] lattice_0.20-35 coop_0.6-0 foreach_1.4.3
## [10] bigmemory_4.5.19 stringr_1.2.0 tools_3.4.1
## [13] QUIC_1.1 parallel_3.4.1 grid_3.4.1
## [16] corpcor_1.6.9 htmltools_0.3.6 iterators_1.0.8
## [19] yaml_2.1.14 rprojroot_1.2 digest_0.6.12
## [22] bookdown_0.5 Matrix_1.2-11 RColorBrewer_1.1-2
## [25] codetools_0.2-15 evaluate_0.10.1 rmarkdown_1.6
## [28] stringi_1.1.5 compiler_3.4.1 backports_1.1.0
## [31] pkgconfig_2.0.1