Handling large H5AD datasets
On-disk versus in-memory storage
Developed by Gabriel Hoffman
Source:vignettes/h5ad_on_disk.Rmd
h5ad_on_disk.Rmd
Introduction
As the scale of single cell RNA-seq datasets continues to increase,
loading the entire dataset into memory is often impractical. Dreamlet
can take advantage of efficient data storage through the Bioconductor’s SingleCellExperiment
interface to dramatically reduce memory usage. There are two general
methods for storing large single-cell datasets. SingleCellExperiment,
and therefore dreamlet
, are compatible with both:
On-disk storage: The data remains on the physical disk and is only read into memory when requested. In R there is little data duplication in memory since operations are delayed until the result is requested using
DelayedMatrix
. Memory usage is minimized at the cost of performance since disk access can be slow.In-memory storage: Single cell read counts are generally sprase with, say, 20% of read counts being non-zero. The counts can be stored as a
sparseMatrix
that stores only non-zero entries. Performance is higher since data is in memory, but excessive memory usage can be a problem.
On-disk storage: zellkonverter
zellkonverter takes advantage of the H5AD file format built on the HDF5 format in order to dramatically reduce memory usage while still retaining performance. The zellkonverter package uses a DelayedArray backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the SingleCellExperiment class.
zellkonverter
is an interface for python code. Note that
the python dependencies are installed the first time
readH5AD()
is run, rather than when the package is
installed. So the first run of readH5AD()
can take a few
minutes, but subsequent runs will be fast.
Standard usage
library(zellkonverter)
library(SingleCellExperiment)
# Create SingleCellExperiment object that points to on-disk H5AD file
sce <- readH5AD(h5ad_file, use_hdf5 = TRUE)
Access alternative data
Some software, including pegasus, store
normalized data in the default portion of the file and save raw counts
in the raw/
field. Here, load a H5AD file generated by
pegasus.
# read raw/ from H5AD file
# raw = TRUE tells readH5AD() to read alternative data
# Must use zellkonverter >=1.3.3
sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE)
# use `raw` as counts
sce <- swapAltExp(sce_in, "raw") # use raw as main
counts(sce) <- assay(sce, "X") # set counts assay to data in X
assay(sce, "X") <- NULL # free X
In-memory storage: Seurat
Seurat can also convert
and import H5AD files, and then convert to SingleCellExperiment.
The resulting SingleCellExperiment
object stores data as a
sparseMatrix
.
Comparing interfaces
The SingleCellExperiment
interface to
zellkonverter
and Seurat
hides the backend
differences from the typical R user. The usage of dreamlet
is the same in both cases. For small to medium datasets, the performance
differences should be minimal. However, for large datasets there can be
a substantial difference in performance. Use of
zellkonverter
and DelayedMatrix
minimize
memory usage at the cost of computationl performance. Use of
Seurat
and sparseMatrix
can be faster if there
is sufficent memory.
aggregateToPseudoBulk()
for computing pseudobulk data
uses different backend implements for on-disk DelayedMatrix
and in-memory sparseMatrix
. Both versions are highly
optimized compared to a naive implementation.