Testing non-linear effects
Categories and continuous variables
Developed by Gabriel Hoffman
Run on 2024-11-05 15:35:38
Source:vignettes/non_lin_eff.Rmd
non_lin_eff.Rmd
Introduction
Typical analysis using regression models assumes a linear affect of the covariate on the response. Here we consider testing non-linear effects in the case of 1) continuous and 2) ordered categorical variables.
We demonstrate this feature on a lightly modified analysis of PBMCs from 8 individuals stimulated with interferon-β (Kang, et al, 2018, Nature Biotech).
Standard processing
Here is the code from the main vignette:
library(dreamlet)
library(muscat)
library(ExperimentHub)
library(scater)
# Download data, specifying EH2259 for the Kang, et al study
eh <- ExperimentHub()
sce <- eh[["EH2259"]]
# only keep singlet cells with sufficient reads
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce <- sce[, colData(sce)$multiplets == "singlet"]
# compute QC metrics
qc <- perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
# set variable indicating stimulated (stim) or control (ctrl)
sce$StimStatus <- sce$stim
sce$id <- paste0(sce$StimStatus, sce$ind)
# Create pseudobulk
pb <- aggregateToPseudoBulk(sce,
assay = "counts",
cluster_id = "cell",
sample_id = "id",
verbose = FALSE
)
Continuous variable
Consider the continuous variable Age
. Typical analysis
only considers linear effects using a single regression coefficient, but
we also want to consider the non-linear effects of age. We can peform a
basis
expansion using splines instead use 3 coefficients to model the age
effect.
# Simulate age between 18 and 65
pb$Age <- runif(ncol(pb), 18, 65)
# formula included non-linear effects of Age
# by using a natural spline of degree 3
# This corresponds to using 3 coefficients instead of 1
form <- ~ splines::ns(Age, 3)
# Normalize and apply voom/voomWithDreamWeights
res.proc <- processAssays(pb, form, min.count = 5)
# Differential expression analysis within each assay
res.dl <- dreamlet(res.proc, form)
# The spline has degree 3, so there are 3 coefficients
# estimated for Age effects
coefNames(res.dl)
## [1] "(Intercept)" "splines::ns(Age, 3)1" "splines::ns(Age, 3)2"
## [4] "splines::ns(Age, 3)3"
# Jointly test effects of the 3 spline components
# The test of the 3 coefficients is performed with an F-statistic
topTable(res.dl, coef = coefNames(res.dl)[2:4], number = 3)
## DataFrame with 3 rows and 9 columns
## assay ID splines..ns.Age..3.1 splines..ns.Age..3.2
## <character> <character> <numeric> <numeric>
## 1 CD4 T cells GTF3A 0.751933 2.24124
## 2 CD4 T cells RGS2 -0.591826 -3.72216
## 3 CD4 T cells HLA-DRB1_ENSG0000019.. -0.924649 -1.94493
## splines..ns.Age..3.3 AveExpr F P.Value adj.P.Val
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 -0.418741 8.63391 16.4736 2.82474e-05 0.178952
## 2 0.385533 6.97805 14.9336 5.15218e-05 0.178952
## 3 1.251838 5.24781 14.7922 5.45632e-05 0.178952
Ordered categorical
We can also test non-linear effects in the case of categorical variables with a natural ordering to the categories. Consider time course data with 4 time points. Each time point is a category and has a natural ordering from first to last.
We have multiple options to model the time course.
-
Continuous: Modeling time point as a continuous variable uses a single regression coefficient to model the linear effects of the time course. This is simple, models the order of the time points, but ignores non-linear effects
Model using
as.numeric(TimePoint)
-
Categorical: Including time point as a typical categorical variable uses estimated the mean response value for each category. So it estimates 4 coefficients. While this can be useful for comparing two categories, it ignores the order of the time points.
Model using
factor(TimePoint)
-
Ordered categorical: Here, the trend across ordered time points is modled using orthogonal polynomials. The trend is decomposed into independent linear, quadratic, etc., effects that can be tested either jointly or by themselves.
Model using:
Here we simulated 4 time points, and perform differential expression analysis.
# Consider data generated across 4 time points
# While there are no time points in the real data
# we can add some for demonstration purposes
pb$TimePoint <- ordered(paste0("time_", rep(1:4, 4)))
# examine the ordering
pb$TimePoint
## [1] time_1 time_2 time_3 time_4 time_1 time_2 time_3 time_4 time_1 time_2
## [11] time_3 time_4 time_1 time_2 time_3 time_4
## Levels: time_1 < time_2 < time_3 < time_4
# Use formula including time point
form <- ~TimePoint
# Normalize and apply voom/voomWithDreamWeights
res.proc <- processAssays(pb, form, min.count = 5)
# Differential expression analysis within each assay
res.dl <- dreamlet(res.proc, form)
# Examine the coefficient estimated
# for TimePoint it estimates
# linear (i.e. L)
# quadratic (i.e. Q)
# and cubic (i.e. C) effects
coefNames(res.dl)
## [1] "(Intercept)" "TimePoint.L" "TimePoint.Q" "TimePoint.C"
# Test only linear effect
topTable(res.dl, coef = "TimePoint.L", number = 3)
## DataFrame with 3 rows and 9 columns
## assay ID logFC AveExpr t P.Value adj.P.Val
## <character> <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 CD4 T cells DCXR -0.671645 6.52867 -5.42880 4.78666e-05 0.393058
## 2 CD4 T cells GGA2 -0.890112 4.98987 -5.26372 6.69370e-05 0.393058
## 3 CD8 T cells FTH1 -0.759875 14.55380 -4.99199 8.13672e-05 0.393058
## B z.std
## <numeric> <numeric>
## 1 1.686147 -5.42880
## 2 0.936906 -5.26372
## 3 1.627753 -4.99199
# Test linear, quadratic and cubic effcts
coefs <- c("TimePoint.L", "TimePoint.Q", "TimePoint.C")
topTable(res.dl, coef = coefs, number = 3)
## DataFrame with 3 rows and 9 columns
## assay ID TimePoint.L TimePoint.Q TimePoint.C
## <character> <character> <numeric> <numeric> <numeric>
## 1 CD8 T cells CD52 0.903032 -1.528456 -0.984065
## 2 CD8 T cells CCL5_ENSG00000161570 0.626201 -0.986502 -0.645468
## 3 CD8 T cells CD2 0.887578 -1.152415 -0.195616
## AveExpr F P.Value adj.P.Val
## <numeric> <numeric> <numeric> <numeric>
## 1 8.69458 16.0946 1.90868e-05 0.0853298
## 2 11.88811 16.0433 1.94982e-05 0.0853298
## 3 9.67181 15.7793 2.17777e-05 0.0853298
Sample filtering
Due to variation in cell and read count for each sample,
processAssays()
filters out some sample. This filtering is
summarized here:
details(res.dl)
## assay n_retain formula formDropsTerms n_genes n_errors
## 1 B cells 16 ~TimePoint FALSE 1961 0
## 2 CD14+ Monocytes 16 ~TimePoint FALSE 3087 0
## 3 CD4 T cells 16 ~TimePoint FALSE 5262 0
## 4 CD8 T cells 16 ~TimePoint FALSE 1030 0
## 5 Dendritic cells 13 ~TimePoint FALSE 164 0
## 6 FCGR3A+ Monocytes 16 ~TimePoint FALSE 1160 0
## 7 Megakaryocytes 13 ~TimePoint FALSE 172 0
## 8 NK cells 16 ~TimePoint FALSE 1656 0
## error_initial
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## 8 FALSE
Whle all 16 samples are detained in B cells, only 9 are retained for
megakaryocytes. This can result in a time point being dropped, and so
the polynomial expansion for some cell types can have a lower degree.
The combined results will then have NA
values for these
coefficients. For example, for TIMP1
in
Megakaryocytes
above there is not enought data to fit the
cubic term, so TimePoint.C
is NA
.
Session Info
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin23.5.0
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Users/gabrielhoffman/prog/R-4.4.0/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] muscData_1.18.0 scater_1.32.1
## [3] scuttle_1.14.0 ExperimentHub_2.12.0
## [5] AnnotationHub_3.12.0 BiocFileCache_2.12.0
## [7] dbplyr_2.5.0 muscat_1.18.0
## [9] dreamlet_1.3.3 SingleCellExperiment_1.26.0
## [11] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [13] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
## [15] IRanges_2.38.1 S4Vectors_0.42.1
## [17] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
## [19] matrixStats_1.4.1 variancePartition_1.35.5
## [21] BiocParallel_1.38.0 limma_3.60.4
## [23] ggplot2_3.5.1 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.4 bitops_1.0-8
## [3] httr_1.4.7 RColorBrewer_1.1-3
## [5] doParallel_1.0.17 Rgraphviz_2.48.0
## [7] numDeriv_2016.8-1.1 tools_4.4.0
## [9] sctransform_0.4.1 backports_1.5.0
## [11] utf8_1.2.4 R6_2.5.1
## [13] metafor_4.6-0 mgcv_1.9-1
## [15] GetoptLong_1.0.5 withr_3.0.1
## [17] prettyunits_1.2.0 gridExtra_2.3
## [19] cli_3.6.3 textshaping_0.4.0
## [21] sass_0.4.9 KEGGgraph_1.64.0
## [23] SQUAREM_2021.1 mvtnorm_1.3-1
## [25] blme_1.0-6 pkgdown_2.1.1
## [27] mixsqp_0.3-54 systemfonts_1.1.0
## [29] zenith_1.6.0 parallelly_1.38.0
## [31] invgamma_1.1 RSQLite_2.3.7
## [33] generics_0.1.3 shape_1.4.6.1
## [35] gtools_3.9.5 dplyr_1.1.4
## [37] Matrix_1.7-0 metadat_1.2-0
## [39] ggbeeswarm_0.7.2 fansi_1.0.6
## [41] abind_1.4-8 lifecycle_1.0.4
## [43] yaml_2.3.10 edgeR_4.2.1
## [45] mathjaxr_1.6-0 gplots_3.1.3.1
## [47] SparseArray_1.4.8 grid_4.4.0
## [49] blob_1.2.4 crayon_1.5.3
## [51] lattice_0.22-6 beachmat_2.20.0
## [53] msigdbr_7.5.1 annotate_1.82.0
## [55] KEGGREST_1.44.1 pillar_1.9.0
## [57] knitr_1.48 ComplexHeatmap_2.20.0
## [59] rjson_0.2.23 boot_1.3-31
## [61] corpcor_1.6.10 future.apply_1.11.2
## [63] codetools_0.2-20 glue_1.8.0
## [65] data.table_1.16.0 vctrs_0.6.5
## [67] png_0.1-8 Rdpack_2.6.1
## [69] gtable_0.3.5 assertthat_0.2.1
## [71] cachem_1.1.0 xfun_0.47
## [73] mime_0.12 rbibutils_2.2.16
## [75] S4Arrays_1.4.1 Rfast_2.1.0
## [77] iterators_1.0.14 statmod_1.5.0
## [79] nlme_3.1-166 pbkrtest_0.5.3
## [81] bit64_4.5.2 filelock_1.0.3
## [83] progress_1.2.3 EnvStats_3.0.0
## [85] bslib_0.8.0 TMB_1.9.15
## [87] irlba_2.3.5.1 vipor_0.4.7
## [89] KernSmooth_2.23-24 colorspace_2.1-1
## [91] rmeta_3.0 DBI_1.2.3
## [93] DESeq2_1.44.0 tidyselect_1.2.1
## [95] curl_5.2.3 bit_4.5.0
## [97] compiler_4.4.0 graph_1.82.0
## [99] BiocNeighbors_1.22.0 desc_1.4.3
## [101] DelayedArray_0.30.1 bookdown_0.40
## [103] scales_1.3.0 caTools_1.18.3
## [105] remaCor_0.0.18 rappdirs_0.3.3
## [107] stringr_1.5.1 digest_0.6.37
## [109] minqa_1.2.8 rmarkdown_2.28
## [111] aod_1.3.3 XVector_0.44.0
## [113] RhpcBLASctl_0.23-42 htmltools_0.5.8.1
## [115] pkgconfig_2.0.3 lme4_1.1-35.5
## [117] sparseMatrixStats_1.16.0 mashr_0.2.79
## [119] fastmap_1.2.0 rlang_1.1.4
## [121] GlobalOptions_0.1.2 htmlwidgets_1.6.4
## [123] UCSC.utils_1.0.0 DelayedMatrixStats_1.26.0
## [125] jquerylib_0.1.4 jsonlite_1.8.9
## [127] BiocSingular_1.20.0 RCurl_1.98-1.16
## [129] magrittr_2.0.3 GenomeInfoDbData_1.2.12
## [131] munsell_0.5.1 Rcpp_1.0.13
## [133] babelgene_22.9 viridis_0.6.5
## [135] EnrichmentBrowser_2.34.1 RcppZiggurat_0.1.6
## [137] stringi_1.8.4 zlibbioc_1.50.0
## [139] MASS_7.3-61 plyr_1.8.9
## [141] listenv_0.9.1 parallel_4.4.0
## [143] ggrepel_0.9.6 Biostrings_2.72.1
## [145] splines_4.4.0 hms_1.1.3
## [147] circlize_0.4.16 locfit_1.5-9.10
## [149] reshape2_1.4.4 ScaledMatrix_1.12.0
## [151] BiocVersion_3.19.1 XML_3.99-0.17
## [153] evaluate_1.0.0 RcppParallel_5.1.9
## [155] BiocManager_1.30.25 nloptr_2.1.1
## [157] foreach_1.5.2 tidyr_1.3.1
## [159] purrr_1.0.2 future_1.34.0
## [161] clue_0.3-65 scattermore_1.2
## [163] ashr_2.2-63 rsvd_1.0.5
## [165] broom_1.0.6 xtable_1.8-4
## [167] fANCOVA_0.6-1 viridisLite_0.4.2
## [169] ragg_1.3.3 truncnorm_1.0-9
## [171] tibble_3.2.1 lmerTest_3.1-3
## [173] glmmTMB_1.1.9 memoise_2.0.1
## [175] beeswarm_0.4.0 AnnotationDbi_1.66.0
## [177] cluster_2.1.6 globals_0.16.3
## [179] GSEABase_1.66.0