Using crumblr in practice

Introduction

Changes in cell type composition play an important role in health and disease. Recent advances in single cell technology have enabled measurement of cell type composition at increasing cell lineage resolution across large cohorts of individuals. Yet this raises new challenges for statistical analysis of these compositional data to identify changes associated with a phenotype. We introduce crumblr, a scalable statistical method for analyzing count ratio data using precision-weighted linear models incorporating random effects for complex study designs. Uniquely, crumblr performs tests of association at multiple levels of the cell lineage hierarchy using multivariate regression to increase power over tests of a single cell component. In simulations, crumblr increases power compared to existing methods, while controlling the false positive rate.

The crumblr package integrates with the variancePartition and dreamlet packages in the Bioconductor ecosystem.

Here we consider counts for 8 cell types from quantified using single cell RNA-seq data from unstimulated and interferon β stimulated PBMCs from 8 subjects (Kang, et al., 2018).

The functions here incorporate the precision weights:

Installation

To install this package, start R and enter:

# 1) Make sure Bioconductor is installed
if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# 2) Install crumblr and dependencies:
# From Bioconductor
BiocManager::install("crumblr")

Analysis workflow

Process data

Here we evaluate whether the observed cell proportions change in response to interferon β. Given the results here, we cannot reject the null hypothesis that interferon β does not affect the cell type proportions.

library(crumblr)

# Load cell counts, clustering and metadata
# from Kang, et al. (2018) https://doi.org/10.1038/nbt.4042
data(IFNCellCounts)

# Apply crumblr transformation
# cobj is an EList object compatable with limma workflow
# cobj$E stores transformed values
# cobj$weights stores precision weights
#    corresponding to the regularized inverse variance
cobj <- crumblr(df_cellCounts)

Variance partitioning

Decomposing the variance illustrates that more variation is explained by subject than stimulation status.

library(variancePartition)

# Partition variance into components for Subject (i.e. ind)
#   and stimulation status, and residual variation
form <- ~ (1 | ind) + (1 | StimStatus)
vp <- fitExtractVarPartModel(cobj, form, info)

# Plot variance fractions
fig.vp <- plotPercentBars(vp)
fig.vp

PCA

Performing PCA on the transformed cell counts indicates that the samples cluster based on subject rather than stimulation status. Here, we use standardize() so that each observation has approximately equal variance (i.e. homoscedasticity) by dividing the CLR transformed frequencies by their estimated sampling standard deviation. Transforming the data to be approximately homoscedastic has been shown to improve performance of PCA.

library(ggplot2)

# Perform PCA
# use crumblr::standardize() to get values with
# approximately equal sampling variance,
# which is a key property for downstream PCA and clustering analysis.
pca <- prcomp(t(standardize(cobj)))

# merge with metadata
df_pca <- merge(pca$x, info, by = "row.names")

# Plot PCA
#   color by Subject
#   shape by Stimulated vs unstimulated
ggplot(df_pca, aes(PC1, PC2, color = as.character(ind), shape = StimStatus)) +
  geom_point(size = 3) +
  theme_classic() +
  theme(aspect.ratio = 1) +
  scale_color_discrete(name = "Subject") +
  xlab("PC1") +
  ylab("PC2")

Hierarchical clustering

The samples from the same subject also cluster together.

heatmap(cobj$E)

Differential testing

# Use variancePartition workflow to analyze each cell type
# Perform regression on each cell type separately
#  then use eBayes to shrink residual variance
# Also compatible with limma::lmFit()
fit <- dream(cobj, ~ StimStatus + ind, info)
fit <- eBayes(fit)

# Extract results for each cell type
topTable(fit, coef = "StimStatusstim", number = Inf)

##                         logFC    AveExpr          t     P.Value  adj.P.Val         B
## CD8 T cells       -0.25085170  0.0857175 -4.0787416 0.002436375 0.01949100 -1.279815
## Dendritic cells    0.37386979 -2.1849234  3.1619195 0.010692544 0.02738587 -2.638507
## CD14+ Monocytes   -0.10525402  1.2698117 -3.1226341 0.011413912 0.02738587 -2.709377
## B cells           -0.10478652  0.5516882 -3.0134349 0.013692935 0.02738587 -2.940542
## CD4 T cells       -0.07840101  2.0201947 -2.2318104 0.050869691 0.08139151 -4.128069
## FCGR3A+ Monocytes  0.07425165 -0.2567492  1.6647681 0.128337022 0.17111603 -4.935304
## NK cells           0.10270672  0.3797777  1.5181860 0.161321761 0.18436773 -5.247806
## Megakaryocytes     0.01377768 -1.8655172  0.1555131 0.879651456 0.87965146 -6.198336

Multivariate testing along a tree

We can gain power by jointly testing multiple cell types using a multivariate statistical model, instead of testing one cell type at a time. Here we construct a hierarchical clustering between cell types based on gene expression from pseudobulk, and perform a multivariate test for each internal node of the tree based on its leaf nodes. The results for the leaves are the same as from topTable() above. At each internal node treeTest() performs a fixed effects meta-analysis of the coefficients of the leaves while modeling the covariance between coefficient estimates. In the backend, this is implemented using variancePartition::mvTest() and remaCor package.

Here the hierarchical clustering, hcl, is precomputed from pseudobulk gene expression using buildClusterTreeFromPB().

# Perform multivariate test across the hierarchy
res <- treeTest(fit, cobj, hcl, coef = "StimStatusstim")

# Plot hierarchy and testing results
plotTreeTest(res)

# Plot hierarchy and regression coefficients
plotTreeTestBeta(res)

Combined plotting

plotTreeTestBeta(res) +
  theme(legend.position = "bottom", legend.box = "vertical") |
  plotForest(res, hide = FALSE) |
  fig.vp

Hierarchical clustering

The hierarchical clustering used by treeTest() can be computed a number of ways, depending on the available data and biological question. For example, see Article for details about how hierarchical clustering was run in this dataset.

In general, hierarchical clustering can be computed from

pseudobulked single cell gene expression

hcl <- buildClusterTreeFromPB(pb)

cell type frequencies

# correlation matrix between all cell types
C <- cor(t(standardize(cobj)))

# convert to distance
dm <- as.dist(1 - abs(C))

# eval hierarchical clustering
hcl <- hclust(dm)

Newick formated tree computed from external data

# Make sure packages are installed
# BiocManager::install(c("ctc", "ape", "phylogram"))
library(ape)
library(ctc)
library(phylogram)
library(tidyverse)

# Write tree to file, 
# edit manually
# then read back into R
# 

# Specify tree as text in Newick format
txt = "((CD14+ Monocytes,(B cells,(Dendritic cells,Megakaryocytes))),(CD8 T cells,(NK cells,(CD4 T cells,FCGR3A+ Monocytes))));"

# read from text
hcl_from_txt <- read.tree(text = txt) %>%
                  as.dendrogram.phylo %>%
                  as.hclust

# Alternatively, 
# write existing tree to file
# and edit manaully
write(hc2Newick(hcl),file='hcl.nwk')

hcl_from_txt2 <- read.tree(file = 'hcl.nwk') %>%
                  as.dendrogram.phylo %>%
                  as.hclust

Considerations

Computational scaling

The crumblr() workflow is very fast, especially compared to more demanding differential expression analysies. Applying crumblr() requires <1 sec, even for very large datsets. Differential testing with dream() takes < 10 seconds for fixed effects models and < 30 seconds for mixed models with typical sample sizes and number of cell types. Running treeTest() can be a little more demanding, but should finish in < 30 seconds with 20 cell types.

Complex study designs

The crumblr() workflow can handle complex study designs with repeated measures or multiple random effects. dream() uses lme4::lmer() in the backend to fit weighted linear mixed models. Considerations about defining the regression model are described in documentation to variancePartition or this book by the author of lme4.

Tuning parameters

crumblr() uses two tuning parameters accessable to the user. These are fixed to default values in all simulations and data analysis. The user is strongly recommended to keep these dfault values.

In order to deal with zero counts, crumblr() uses a pseudocount (default: 0.5) added to all observed counts.
For real data, the asymptotic variance formula can give weights that vary substantially across samples and give very high weights for a subset of samples. In order to address this, we regularize the weights to reduce the variation in the weights to have a maximum ratio (default of 5) between the maximum and specified quantile value (default of 5%).

Session Info

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS/LAPACK: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] muscat_1.22.0            crumblr_0.99.22          variancePartition_1.37.4
## [4] BiocParallel_1.42.1      limma_3.64.3             ggplot2_3.5.2           
## [7] BiocStyle_2.36.0        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.1               bitops_1.0-9                ggplotify_0.1.2            
##   [4] tibble_3.3.0                lifecycle_1.0.4             Rdpack_2.6.4               
##   [7] edgeR_4.6.3                 doParallel_1.0.17           globals_0.18.0             
##  [10] lattice_0.22-7              MASS_7.3-65                 backports_1.5.0            
##  [13] magrittr_2.0.3              sass_0.4.10                 rmarkdown_2.29             
##  [16] jquerylib_0.1.4             yaml_2.3.10                 sctransform_0.4.2          
##  [19] minqa_1.2.8                 RColorBrewer_1.1-3          abind_1.4-8                
##  [22] EnvStats_3.1.0              glmmTMB_1.1.11              GenomicRanges_1.60.0       
##  [25] purrr_1.1.0                 BiocGenerics_0.54.0         yulab.utils_0.2.0          
##  [28] circlize_0.4.16             GenomeInfoDbData_1.2.14     IRanges_2.42.0             
##  [31] S4Vectors_0.46.0            ggrepel_0.9.6               pbkrtest_0.5.5             
##  [34] irlba_2.3.5.1               listenv_0.9.1               tidytree_0.4.6             
##  [37] parallelly_1.45.1           pkgdown_2.1.3               codetools_0.2-20           
##  [40] DelayedArray_0.34.1         scuttle_1.18.0              tidyselect_1.2.1           
##  [43] shape_1.4.6.1               aplot_0.2.8                 UCSC.utils_1.4.0           
##  [46] farver_2.1.2                lme4_1.1-37                 ScaledMatrix_1.16.0        
##  [49] viridis_0.6.5               matrixStats_1.5.0           stats4_4.5.1               
##  [52] jsonlite_2.0.0              GetoptLong_1.0.5            BiocNeighbors_2.2.0        
##  [55] scater_1.36.0               iterators_1.0.14            systemfonts_1.2.3          
##  [58] foreach_1.5.2               tools_4.5.1                 progress_1.2.3             
##  [61] treeio_1.32.0               ragg_1.4.0                  Rcpp_1.1.0                 
##  [64] blme_1.0-6                  glue_1.8.0                  gridExtra_2.3              
##  [67] SparseArray_1.8.1           xfun_0.52                   mgcv_1.9-3                 
##  [70] DESeq2_1.48.1               MatrixGenerics_1.20.0       GenomeInfoDb_1.44.1        
##  [73] dplyr_1.1.4                 withr_3.0.2                 numDeriv_2016.8-1.1        
##  [76] BiocManager_1.30.26         fastmap_1.2.0               boot_1.3-31                
##  [79] caTools_1.18.3              digest_0.6.37               rsvd_1.0.5                 
##  [82] R6_2.6.1                    gridGraphics_0.5-1          textshaping_1.0.1          
##  [85] colorspace_2.1-1            gtools_3.9.5                RhpcBLASctl_0.23-42        
##  [88] tidyr_1.3.1                 generics_0.1.4              data.table_1.17.8          
##  [91] corpcor_1.6.10              prettyunits_1.2.0           httr_1.4.7                 
##  [94] htmlwidgets_1.6.4           S4Arrays_1.8.1              pkgconfig_2.0.3            
##  [97] gtable_0.3.6                ComplexHeatmap_2.24.1       SingleCellExperiment_1.30.1
## [100] XVector_0.48.0              remaCor_0.0.18              htmltools_0.5.8.1          
## [103] bookdown_0.43               TMB_1.9.17                  zigg_0.0.2                 
## [106] clue_0.3-66                 scales_1.4.0                Biobase_2.68.0             
## [109] png_0.1-8                   fANCOVA_0.6-1               reformulas_0.4.1           
## [112] ggfun_0.2.0                 knitr_1.50                  reshape2_1.4.4             
## [115] rjson_0.2.23                nlme_3.1-168                nloptr_2.2.1               
## [118] cachem_1.1.0                GlobalOptions_0.1.2         stringr_1.5.1              
## [121] KernSmooth_2.23-26          parallel_4.5.1              vipor_0.4.7                
## [124] desc_1.4.3                  pillar_1.11.0               grid_4.5.1                 
## [127] vctrs_0.6.5                 gplots_3.2.0                BiocSingular_1.24.0        
## [130] beachmat_2.24.0             cluster_2.1.8.1             beeswarm_0.4.0             
## [133] evaluate_1.0.4              mvtnorm_1.3-3               cli_3.6.5                  
## [136] locfit_1.5-9.12             compiler_4.5.1              rlang_1.1.6                
## [139] crayon_1.5.3                future.apply_1.20.0         labeling_0.4.3             
## [142] plyr_1.8.9                  fs_1.6.6                    ggbeeswarm_0.7.2           
## [145] stringi_1.8.7               viridisLite_0.4.2           lmerTest_3.1-3             
## [148] lazyeval_0.2.2              aod_1.3.3                   Matrix_1.7-3               
## [151] hms_1.1.3                   patchwork_1.3.1             future_1.67.0              
## [154] statmod_1.5.0               SummarizedExperiment_1.38.1 rbibutils_2.3              
## [157] Rfast_2.1.5.1               broom_1.0.9                 RcppParallel_5.1.10.9000   
## [160] bslib_0.9.0                 ggtree_3.16.3               ape_5.8-1

Developed by Gabriel Hoffman

Run on 2025-08-14 20:44:44.222783

Introduction

Installation

Analysis workflow

Process data

Variance partitioning

PCA

Hierarchical clustering

Differential testing

Multivariate testing along a tree

Combined plotting

Hierarchical clustering

Considerations

Computational scaling

Complex study designs

Tuning parameters

Session Info