rm(list = ls())
suppressPackageStartupMessages({
library(cevoDatasets)
library(cevomod)
library(tidyverse)
})theme_set(theme_minimal())
Sampling-oriented modelling of cancer evolution
Introduction
The rapid development of Next Generation Sequencing (NGS) methods resulted in the development of many methods for the deconvolution of tumor subclonal structure from the NGS data. Whereas many of them figure out tumor subclonal structure by clustering the mutations with similar allelic frequencies, recently published MOBSTER (Caravagna et al. 2020) applies the population genetics model not only to recognize the tumor clonal structure but also to infer the evolutionary parameters of identified subclones. Despite the criticism (Tarabichi et al. 2018; Tung and Durrett 2021) of the model’s assumptions, it offers an interesting way to investigate the process of tumor evolution.
An important limitation of this method is its requirement for deeply sequenced Whole Genome Sequencing (WGS) data. MOBSTER might be unable to recognize the low-frequency neutral mutations tail if many of them were removed by the variant calling algorithm due to data quality issues like sequencing depth. In our work, we reimplemented this model aware of this sampling event to fit the power- law neutral mutations tail into Whole Exome Sequencing data (WES), cheaper and more often available than WGS data.
Materials
We used the sequencing results from two sets of data:
- Breast Cancer study (unpublished yet, denoted BRCA)
- 10 patients
- 2-3 tumor samples + 1 control sample per patient
- Whole Exome Sequencing
- Acute Myleoid Leukaemia study (Shlush et al. 2017) (AMLRO)
- 11 patients
- 2 tumor samples + 1 control sample per patient
- Whole Genome Sequencing
Site Frequency Spectras for all the samples are shown in the Figure 2
<- calc_SFS(AMLRO)
amlro <- calc_SFS(OPUS_BRCA)
brca
plot_SFS(amlro, geom = "bar") +
hide_legend() +
facet_wrap(~sample_id, scales = "free_y") +
labs(title = "")
plot_SFS(brca, geom = "bar") +
hide_legend() +
facet_wrap(~sample_id, scales = "free_y") +
labs(title = "")
MOBSTER
We used MOBSTER (Caravagna et al. 2020) to identify clusters of neutral tail, clonal, and subclonal mutations. As shown in the figure Figure 3, MOBSTER was unable to recognize the neutral tail of mutations in any sample.
cevomod
In cevomod we fit the MOBSTER-like mixture of power-law and binomial distributions in several steps.
Neutral tail fitting
First, we use (Williams et al. 2016) formula and plot the cumulative number of mutations \(M(f)\) against the reciprocal of the mutations’ frequency \(1/f\) (Figure 4).
<- calc_Mf_1f(amlro)
amlro |>
amlro plot_Mf_1f(scale = FALSE) +
facet_wrap(~sample_id, scales = "free") +
hide_legend()
The relationship \(M(f) \sim 1/f\) should be linear for the mutations in the neutrally growing population, thus its slope can be used to fit the power-law curver of the neutral tail mutations on the Site Frequency Spectra plot.
We use sliding window algorithm to fit a number of linear \(M(f) = \alpha/f\) models to the 0.05-wide ranges of allelic frequencies and filter out the fits with R_squared lower than 0.98 (Williams et al. 2016). From the remaining models, we select the one with the lowest \(\alpha\) (Figure 5 and Figure 6, Table 1)
<- fit_neutral_models(amlro)
amlro |>
amlro plot_Mf_1f(scale = FALSE, alpha = 0.5) +
layer_lm_fits(amlro) +
facet_wrap(~sample_id, scales = "free") +
hide_legend()
|>
amlro plot_neutral_A_coefficients() +
labs(
x = "Variant Allele Frequency",
y = "alpha coefficient",
color = "selected coefficient"
)
$models$neutral_models |>
amlrofilter(best) |>
select(sample_id, component, slope = A, intercept = b) |>
head() |>
::kable() knitr
sample_id | component | slope | intercept |
---|---|---|---|
AMLRO-1_Dx | Neutral tail | 839.9501 | 22202.16263 |
AMLRO-1_Rx | Neutral tail | 865.9478 | 21987.18911 |
AMLRO-10_Dx | Neutral tail | 404.5073 | 81.72711 |
AMLRO-10_Rx | Neutral tail | 462.5176 | 736.69332 |
AMLRO-11_Dx | Neutral tail | 335.2720 | 1503.61682 |
AMLRO-11_Rx | Neutral tail | 301.7115 | 675.66577 |
Now we can use the selected \(\alpha\) coefficients to draw the power-law curves of the neutral tails on the SFS plots (Figure 7). Since we select the lowest fitted \(\alpha\), values represented by the curve should be higher or equal to the real SFS for most of the possible VAF values, with the exception of the lowest VAF range where the curve detaches from the SFS and predicts the numbers of mutations from expected from the model.
|>
amlro plot_models() +
hide_legend()
Sampling rate
Since the residuals of the neutral model for the lowest allelic frequencies represent the difference between the predicted and real number of low-frequency mutations, we can use those values to estimate the sampling rate, fraction of mutations filtered out by the variant caller due to low evidence reasons (Figure 8).
|>
amlro plot_sampling_rate()
Fitting clonal distributions
Model residuals for higher VAF values represent the numbers of mutations unpredicted by the neutral model. Thus, those values can be used to identify clones and sublones in the sample (Figure 9; Table 2)
<- fit_subclones(amlro)
amlro
plot_residuals_neutral_model(amlro) +
facet_wrap(~sample_id, scales = "free_y") +
hide_legend()
$models$binomial_models |>
amlrofilter(best) |>
select(sample_id, component, cellularity, N_mutations, BIC, mean_DP, sd_DP) |>
head() |>
::kable() knitr
sample_id | component | cellularity | N_mutations | BIC | mean_DP | sd_DP |
---|---|---|---|---|---|---|
AMLRO-1_Dx | Clone | 0.7492541 | 328 | 37002.14 | 61.54762 | 15.32128 |
AMLRO-1_Dx | Subclone 1 | 0.4964259 | 19290 | 37002.14 | 58.36380 | 14.47672 |
AMLRO-1_Dx | Subclone 2 | 0.1759814 | 3296 | 37002.14 | 54.44811 | 47.91583 |
AMLRO-1_Rx | Clone | 0.5192472 | 8522 | 35519.95 | 60.30717 | 13.05149 |
AMLRO-1_Rx | Subclone 1 | 0.4836764 | 10938 | 35519.95 | 60.36889 | 13.17624 |
AMLRO-1_Rx | Subclone 2 | 0.1803055 | 3242 | 35519.95 | 56.47942 | 59.49609 |
For convenence the entire process, the process of model fitting can be run with a single command run_cevomod()
, which takes a SNVs dataset (many samples can be provided to a single call of the function).
<- brca |>
brca run_cevomod()
Then, plot_models()
function can be used to plot the results.
plot_models(amlro) +
hide_legend() +
labs(title = "")
plot_models(brca) +
hide_legend() +
labs(title = "")
To facilitate the exploration of cevomod results for the users not familiar with R we prepared a browser app written in Shiny (Figure 11). In order to use it one should save the cevomod
results into $HOME/.cevomod/data.Rds
with saveRDS()
and then run the browser using run_browser()
function (Listing 1)
Listing 1: Browser usage
saveRDS(amlro, "~/.cevomod/data.Rds")
run_browser()
Additional plots
Main poster plot
Let’s use AMLRO-3_Dx sample with slightly different geoms to make in look better.
<- amlro |>
cd filter(sample_id == "AMLRO-3_Dx")
<- get_neutral_models(cd) |>
neutral_models select(.data$sample_id, .data$from, .data$to)
<- cevomod:::get_residuals(cd) |>
resid left_join(neutral_models, by = "sample_id") |>
group_by(.data$sample_id) |>
mutate(ylim = Inf) |>
ungroup() |>
mutate(neutr = .data$VAF >= .data$from & .data$VAF <= .data$to)
<- c(
colors data = "#00496FFF",
`Neutral Tail` = "#0F85A0FF",
Clones = "#ED8B00FF",
`Final fit` = "#DD4124FF"
)
plot_SFS(cd, geom = "bar", fill = "#00496F", color = "#00496F", alpha = 0.8, show.legend = FALSE) +
geom_area(
aes(.data$VAF, .data$neutral_pred),
data = resid |> filter(.data$neutral_pred < .data$ylim),
color = "#0F85A0FF", size = 1, fill = "#0F85A0FF", alpha = 0.5
+
) geom_area(
aes(.data$VAF, .data$binom_pred),
data = resid,
color = "#ED8B00FF", size = 1, fill = "#ED8B00FF", alpha = 0.3
+
) geom_line(
aes(.data$VAF, .data$model_pred),
data = resid |> filter(.data$neutral_pred < .data$ylim),
size = 1, color = "#DD4124FF"
+
) scale_color_manual(values = colors) +
scale_fill_manual(values = colors) +
coord_cartesian(ylim = c(0, 1400)) +
labs(title = "")
Session Info
::session_info() devtools
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 22.04.1 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Warsaw
date 2022-11-15
pandoc 2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.1)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.1)
BiocGenerics 0.42.0 2022-04-26 [1] Bioconductor
broom 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.1)
callr 3.7.1 2022-07-13 [1] CRAN (R 4.2.1)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.1)
cevoDatasets * 0.2.3 2022-10-12 [1] local
cevomod * 0.3.3 2022-11-11 [1] local
circlize 0.4.15 2022-05-10 [1] CRAN (R 4.2.1)
cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.1)
clue 0.3-61 2022-05-30 [1] CRAN (R 4.2.1)
cluster 2.1.2 2021-04-17 [4] CRAN (R 4.1.1)
codetools 0.2-18 2020-11-04 [4] CRAN (R 4.0.3)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.1)
ComplexHeatmap 2.12.1 2022-08-09 [1] Bioconductor
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.1)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.1)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.1)
devtools 2.4.4 2022-07-20 [1] CRAN (R 4.2.1)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.1)
doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.2.1)
dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.1)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.1)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.1)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.1)
forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.1)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.1)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.1)
gargle 1.2.0 2021-07-02 [1] CRAN (R 4.2.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
GetoptLong 1.0.5 2020-12-15 [1] CRAN (R 4.2.1)
ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.1)
GlobalOptions 0.1.2 2020-06-10 [1] CRAN (R 4.2.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.1)
googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.1)
googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.1)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.1)
haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.1)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.1)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.1)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.1)
httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.1)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.1)
IRanges 2.30.1 2022-08-18 [1] Bioconductor
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.1)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.1)
knitr 1.39 2022-04-26 [1] CRAN (R 4.2.1)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.1)
lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
matrixStats 0.62.0 2022-04-19 [1] CRAN (R 4.2.1)
mclust 5.4.10 2022-05-20 [1] CRAN (R 4.2.1)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.1)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
modelr 0.1.9 2022-08-19 [1] CRAN (R 4.2.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
paletteer 1.4.1.9000 2022-08-22 [1] Github (EmilHvitfeldt/paletteer@aae3e75)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.2.1)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.2.1)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.1)
PNWColors 0.1.0 2022-08-22 [1] Github (jakelawlor/PNWColors@1e3ce47)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.1)
processx 3.7.0 2022-07-07 [1] CRAN (R 4.2.1)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.2.1)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.1)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.1)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.1)
readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.1)
rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.1)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.1)
reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
rjson 0.2.21 2022-01-09 [1] CRAN (R 4.2.1)
rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1)
rmarkdown 2.15 2022-08-16 [1] CRAN (R 4.2.1)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.1)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.1)
S4Vectors 0.34.0 2022-04-26 [1] Bioconductor
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
shape 1.4.6 2021-05-19 [1] CRAN (R 4.2.1)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.1)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
stringr * 1.4.1 2022-08-20 [1] CRAN (R 4.2.1)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.1)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.1)
tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.1)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.1)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.1)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.1)
vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.1)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.1)
xfun 0.32 2022-08-10 [1] CRAN (R 4.2.1)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.1)
[1] /home/pkus/R/x86_64-pc-linux-gnu-library/4.2
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────