1 Introduction and installation

Graph Inference of Population Heterogeneity (griph) is an R package for the analysis of single cell RNA-sequencing data. It can be used to automatically identify different cell types or states, even in the presence of confounding sources of variance such as cell cycle stages or batch effects.

griph is currently available through https://github.com. It can be installed using:

library(devtools)
install_git("git://github.com/ppapasaikas/griph.git", subdir = "griph")

In order for griph to work, some additional packages may have to be installed. You’ll need all the ones under “Needed”. You can do without the ones under “Optional”, as they are only used in the vignette:

## Needed :
##   Matrix, methods, igraph, QUIC, coop, corpcor, foreach, doParallel, parallel, bigmemory, RColorBrewer, Rcpp, RcppProgress, RcppArmadillo, Matrix 
## Optional :
##   BiocStyle, knitr, rmarkdown, testthat, Rtsne, gtools

2 Quickstart: A sample analysis

We will use human induced pluripotency cells from Tung et al. for this example (see details in the “Sample datasets” section). The dataset is included in the package, together with the iPS cell line labels:

M <- readRDS(system.file("extdata", "tung_UMIs_top10k.rds", package = "griph"))
dim(M) # genes by cells
## [1] 10000   864
trueLabel <- attr(M, "label2")
table(trueLabel)
## trueLabel
## NA19098 NA19101 NA19239 
##     288     288     288

Cell types can be identified using griph_cluster:

library(griph)
res <- griph_cluster(M, ClassAssignment = trueLabel, use.par = FALSE,
                     plot = TRUE, fsuffix = 'tung')
table(res$MEMB)
## 
##   1   2   3   4   5   6 
## 102  99  94 282  98 189

The automatic generation of a 2D-embedding plot (plot argument) is switched on here and will create a png (default) or pdf (image.format and fsuffix arguments) file in the current working directory:

list.files("./", pattern = "tung")
## [1] "graph_tung.png" "Lvis_tung.png"

The estimated cell graph (as an igraph object) and the coordinages of the 2D-embedding (when using plot=TRUE) are returned by griph_cluster:

summary(res$GRAO) # full graph
## IGRAPH a1af978 U-W- 864 41672 -- 
## + attr: class (v/c), membership (v/n), community.size (v/n), labels
## | (v/c), weight (e/n)
head(res$plotLVis) # 2D embedding
##                         x         y
## NA19098.r1.A01 -12.422459 -12.15378
## NA19098.r1.A02 -13.228306 -11.99606
## NA19098.r1.A03  -9.097008 -12.05729
## NA19098.r1.A04 -11.683206 -15.24782
## NA19098.r1.A05 -13.078734 -12.74092
## NA19098.r1.A06 -12.846079 -14.07115

3 Visualizing griph results

In the graph returned by griph_cluster, cells are represented by graph vertices. All cells are contained in the full graph (res$GRAO), and a representatitve subset of cells in the simplified graph (res$plotGRAO), which will be generated by plotGraph and used for plotting. The graph can be easily plotted again, for example to separately illustrate known and predicted cell types:

g.true <- plotGraph(res, fill.type = "true", line.type = "none")

res$plotGRAO <- g.true # store in res for re-use
g.pred <- plotGraph(res, fill.type = "predicted", line.type = "none")

In plotGRAO, weak edges are pruned, vertex attributes are added for visualization of predicted and known class assignments (if given) and a subset of the vertices is sampled if the graph exceeds the maxG argument. When plotG is set to FALSE, the plotGRAO component in the return value of griph_cluster is NULL and the simplified graph is generated later when necessary by plotGraph.

It is also possible to draw polygons around cells of each class:

g <- plotGraph(res, line.type = "none", mark.type = "true")

… or to collapse all the cells that are of the same class into a single vertex in the graph (maybe useful for large graphs with many predicted classes):

g <- plotGraph(res, collapse.type = "predicted")

Alternatively (typically for large graph with more than 1000 cells), results can be visualized by applying a dimensionality reduction/projection technique such as t-SNE to the affinity matrix returned by griph:

library(Rtsne)
out <- Rtsne(as.dist(1-res$DISTM), pca = FALSE,
             perplexity = 5, is_distance = TRUE)
plot(out$Y, xlab="tSNE-1", ylab="tSNE-2", col=res$MEMB,
     pch=as.numeric(trueLabel))

or more conveniently using the wrapper function provided by griph:

plotTsne(res, fill.type = "predicted", line.type = "none", mark.type = "true",
         perplexity = 7)

For even larger graphs (thousands of cells), a more efficient visualiztion uses LargeVis (also see , implementing the LargeVis algorithm described by Tang et al. (2016) in )):

plotLVis(res, fill.type = "predicted", line.type = "none", mark.type = "true")

Finally, the plotting functions can also map numerical or categorical values to the color of each cell. An example is to visualize the fraction of detected genes:

plotLVis(res, fill.type = colMeans(M > 0), line.type = "none", mark.type = "true")

4 Accessing the cell classification

griph_cluster identified 6 cell types in the data, and the confusion matrix returned in the result summarizes how they relate to the known cell types or states:

res$ConfMatrix # confusion matrix
##          true
## predicted NA19098 NA19101 NA19239
##         1     100       2       0
##         2      96       3       0
##         3      92       2       0
##         4       0     281       1
##         5       0       0      98
##         6       0       0     189

When comparing different classifications with each other, they may differ in granularity, yet be consistent with each other. This can be seen for example in the confusion matrix above where for example NA190998 cells are found in multiple predicted classes. griph calculates the classification error by assigning each identified class to the major known class (for example NA19098 for predicted class 2) and only counting the 3 cells in 2 as wrongly classified that are not NA19098:

res$miscl # misclassification error
## [1] 0.009259259

5 Sample datasets included in this package

5.1 Buettner et al., Nature Biotechnology 2015

Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Nat Biotechnol. 2015 Feb;33(2):155-60. doi: 10.1038/nbt.3102.

fname <- system.file("extdata", "buettner_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000   288
label <- attr(M, "label")
table(label)
## label
##  G1 G2M   S 
##  96  96  96

5.2 Kolodziejczyk et al., Cell Stem Cell 2015

Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation. Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Tuck AC, Gao X, Bühler M, Liu P, Marioni JC, Teichmann SA. Cell Stem Cell. 2015 Oct 1;17(4):471-85. doi: 10.1016/j.stem.2015.09.011.

fname <- system.file("extdata", "kolodziejck_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000   704
label <- attr(M, "label")
table(label)
## label
## lif  2i a2i 
## 250 295 159

5.3 Usoskin et al., Nat Neurosci. 2015

Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P. Nat Neurosci. 2015 Jan;18(1):145-53. doi: 10.1038/nn.3881.

fname <- system.file("extdata", "usoskin_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000   799
label <- attr(M, "label")
table(label)
## label
## Central, unsolved                NF        NF outlier               NoN 
##                39               139                 9               109 
##       NoN outlier                NP               PEP                TH 
##                 2               169                81               233 
##        TH outlier 
##                18
label2 <- attr(M, "label2")
table(label2)
## label2
## Central, unsolved               NF1             NF2/3             NF4/5 
##                39                31                60                48 
##        NF outlier               NoN       NoN outlier               NP1 
##                 9               109                 2               125 
##               NP2              PEP1              PEP2                TH 
##                44                64                17               233 
##        TH outlier 
##                18
label3 <- attr(M, "label3")
table(label3)
## label3
## Central, unsolved               NF1               NF2               NF3 
##                39                31                48                12 
##               NF4               NF5        NF outlier               NoN 
##                22                26                 9               109 
##       NoN outlier               NP1               NP2               NP3 
##                 2               125                32                12 
##              PEP1              PEP2                TH        TH outlier 
##                64                17               233                18

5.4 Zeisel et al., Science 2015

Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Science. 2015 Mar 6;347(6226):1138-42. doi: 10.1126/science.aaa1934.

fname <- system.file("extdata", "zeisel_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000  3005
label <- attr(M, "label")
table(label)
## label
## astrocytes_ependymal    endothelial-mural         interneurons 
##                  224                  235                  290 
##            microglia     oligodendrocytes        pyramidal CA1 
##                   98                  820                  939 
##         pyramidal SS 
##                  399
label2 <- attr(M, "label2")
table(label2)
## label2
##    Astro1    Astro2   CA1Pyr1   CA1Pyr2 CA1PyrInt   CA2Pyr2   Choroid 
##        68        61       380       447        49        41        10 
##   ClauPyr     Epend      Int1     Int10     Int11     Int12     Int13 
##         5        20        12        21        10        21        15 
##     Int14     Int15     Int16      Int2      Int3      Int4      Int5 
##        22        18        20        24        10        15        20 
##      Int6      Int7      Int8      Int9      Mgl1      Mgl2    (none) 
##        22        23        26        11        17        16       189 
##    Oligo1    Oligo2    Oligo3    Oligo4    Oligo5    Oligo6     Peric 
##        45        98        87       106       125       359        21 
##      Pvm1      Pvm2   S1PyrDL  S1PyrL23   S1PyrL4   S1PyrL5  S1PyrL5a 
##        32        33        81        74        26        16        28 
##   S1PyrL6  S1PyrL6b    SubPyr     Vend1     Vend2      Vsmc 
##        39        21        22        32       105        62

5.5 Tung et al., Sci Rep 2017

Batch effects and the effective design of single-cell gene expression studies. Po-Yuan Tung, John D. Blischak, Chiaowen Joyce Hsiao, David A. Knowles, Jonathan E. Burnett, Jonathan K. Pritchard & Yoav Gilad. Scientific Reports 7, Article number: 39921 (2017). doi:10.1038/srep39921.

fname <- system.file("extdata", "tung_UMIs_top10k.rds", package = "griph")
M <- readRDS(fname)
dim(M)
## [1] 10000   864
label <- attr(M, "label")
table(label)
## label
## NA19098.r1 NA19098.r2 NA19098.r3 NA19101.r1 NA19101.r2 NA19101.r3 NA19239.r1 
##         96         96         96         96         96         96         96 
## NA19239.r2 NA19239.r3 
##         96         96
label2 <- attr(M, "label2")
table(label2)
## label2
## NA19098 NA19101 NA19239 
##     288     288     288

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rtsne_0.13      griph_0.1.1     BiocStyle_2.4.1
## 
## loaded via a namespace (and not attached):
##  [1] igraph_1.1.2        Rcpp_0.12.12        bigmemory.sri_0.1.3
##  [4] knitr_1.17          magrittr_1.5        doParallel_1.0.10  
##  [7] lattice_0.20-35     coop_0.6-0          foreach_1.4.3      
## [10] bigmemory_4.5.19    stringr_1.2.0       tools_3.4.1        
## [13] QUIC_1.1            parallel_3.4.1      grid_3.4.1         
## [16] corpcor_1.6.9       htmltools_0.3.6     iterators_1.0.8    
## [19] yaml_2.1.14         rprojroot_1.2       digest_0.6.12      
## [22] bookdown_0.5        Matrix_1.2-11       RColorBrewer_1.1-2 
## [25] codetools_0.2-15    evaluate_0.10.1     rmarkdown_1.6      
## [28] stringi_1.1.5       compiler_3.4.1      backports_1.1.0    
## [31] pkgconfig_2.0.1