Sampling-oriented modelling of cancer evolution

Author

Paweł Kuś

Published

November 15, 2022

rm(list = ls())
suppressPackageStartupMessages({
  library(cevoDatasets)
  library(cevomod)
  library(tidyverse)
})
theme_set(theme_minimal())

Introduction

Figure 1: Schema of the fitted model. While other methods like MOBSTER attempt to model the mutations from so called ‘Neutral tail’, ignoring the massive removal of low-frequency mutations by variants’ caller results in inability to model it.

The rapid development of Next Generation Sequencing (NGS) methods resulted in the development of many methods for the deconvolution of tumor subclonal structure from the NGS data. Whereas many of them figure out tumor subclonal structure by clustering the mutations with similar allelic frequencies, recently published MOBSTER (Caravagna et al. 2020) applies the population genetics model not only to recognize the tumor clonal structure but also to infer the evolutionary parameters of identified subclones. Despite the criticism (Tarabichi et al. 2018; Tung and Durrett 2021) of the model’s assumptions, it offers an interesting way to investigate the process of tumor evolution.

An important limitation of this method is its requirement for deeply sequenced Whole Genome Sequencing (WGS) data. MOBSTER might be unable to recognize the low-frequency neutral mutations tail if many of them were removed by the variant calling algorithm due to data quality issues like sequencing depth. In our work, we reimplemented this model aware of this sampling event to fit the power- law neutral mutations tail into Whole Exome Sequencing data (WES), cheaper and more often available than WGS data.

Materials

We used the sequencing results from two sets of data:

Breast Cancer study (unpublished yet, denoted BRCA)
- 10 patients
- 2-3 tumor samples + 1 control sample per patient
- Whole Exome Sequencing
Acute Myleoid Leukaemia study (Shlush et al. 2017) (AMLRO)
- 11 patients
- 2 tumor samples + 1 control sample per patient
- Whole Genome Sequencing

Site Frequency Spectras for all the samples are shown in the Figure 2

amlro <- calc_SFS(AMLRO)
brca <- calc_SFS(OPUS_BRCA)

plot_SFS(amlro, geom = "bar") +
  hide_legend() +
  facet_wrap(~sample_id, scales = "free_y") +
  labs(title = "")

plot_SFS(brca, geom = "bar") +
  hide_legend() +
  facet_wrap(~sample_id, scales = "free_y") +
  labs(title = "")

MOBSTER

We used MOBSTER (Caravagna et al. 2020) to identify clusters of neutral tail, clonal, and subclonal mutations. As shown in the figure Figure 3, MOBSTER was unable to recognize the neutral tail of mutations in any sample.

cevomod

In cevomod we fit the MOBSTER-like mixture of power-law and binomial distributions in several steps.

Neutral tail fitting

First, we use (Williams et al. 2016) formula and plot the cumulative number of mutations $M(f)$ against the reciprocal of the mutations’ frequency $1/f$ (Figure 4).

amlro <- calc_Mf_1f(amlro)
amlro |> 
  plot_Mf_1f(scale = FALSE) +
  facet_wrap(~sample_id, scales = "free") +
  hide_legend()

Figure 4: $M(f) \sim 1/f$ plots from samples in AMLRO dataset

The relationship $M(f) \sim 1/f$ should be linear for the mutations in the neutrally growing population, thus its slope can be used to fit the power-law curver of the neutral tail mutations on the Site Frequency Spectra plot.

We use sliding window algorithm to fit a number of linear $M(f) = \alpha/f$ models to the 0.05-wide ranges of allelic frequencies and filter out the fits with R_squared lower than 0.98 (Williams et al. 2016). From the remaining models, we select the one with the lowest $\alpha$ (Figure 5 and Figure 6, Table 1)

amlro <- fit_neutral_models(amlro)
amlro |> 
  plot_Mf_1f(scale = FALSE, alpha = 0.5) +
  layer_lm_fits(amlro) +
  facet_wrap(~sample_id, scales = "free") +
  hide_legend()

Figure 5: Fitted linear models with the lowest slope coefficients

amlro |> 
  plot_neutral_A_coefficients() +
  labs(
    x = "Variant Allele Frequency",
    y = "alpha coefficient",
    color = "selected coefficient"
  )

Figure 6: Slope coefficients for $M(f) = \alpha/f$ models fitted to different ranges of allelic frequencies. Line-ends on the X-axis represent the Variant Allele Frequency interval considered by the model, Y-values represent the slope coefficients.

amlro$models$neutral_models |> 
  filter(best) |> 
  select(sample_id, component, slope = A, intercept = b) |>
  head() |> 
  knitr::kable()

Table 1: Coefficients of the selected *neutral tail* models
sample_id	component	slope	intercept
AMLRO-1_Dx	Neutral tail	839.9501	22202.16263
AMLRO-1_Rx	Neutral tail	865.9478	21987.18911
AMLRO-10_Dx	Neutral tail	404.5073	81.72711
AMLRO-10_Rx	Neutral tail	462.5176	736.69332
AMLRO-11_Dx	Neutral tail	335.2720	1503.61682
AMLRO-11_Rx	Neutral tail	301.7115	675.66577

Now we can use the selected $\alpha$ coefficients to draw the power-law curves of the neutral tails on the SFS plots (Figure 7). Since we select the lowest fitted $\alpha$, values represented by the curve should be higher or equal to the real SFS for most of the possible VAF values, with the exception of the lowest VAF range where the curve detaches from the SFS and predicts the numbers of mutations from expected from the model.

amlro |> 
  plot_models() + 
  hide_legend()

Figure 7: Site Frequency Spectra for AMLRO samples with the fitted *neutral tails*

Sampling rate

Since the residuals of the neutral model for the lowest allelic frequencies represent the difference between the predicted and real number of low-frequency mutations, we can use those values to estimate the sampling rate, fraction of mutations filtered out by the variant caller due to low evidence reasons (Figure 8).

amlro |> 
  plot_sampling_rate()

Figure 8: Sampling rate. Fractions of mutations filtered out by variant callers as function of $f$

Fitting clonal distributions

Model residuals for higher VAF values represent the numbers of mutations unpredicted by the neutral model. Thus, those values can be used to identify clones and sublones in the sample (Figure 9; Table 2)

amlro <- fit_subclones(amlro)

plot_residuals_neutral_model(amlro) +
  facet_wrap(~sample_id, scales = "free_y") +
  hide_legend()

Figure 9: Binomial distributions representing clones/subclones in the AMLRO dataset fitted to the neutral model residuals.

amlro$models$binomial_models |> 
  filter(best) |> 
  select(sample_id, component, cellularity, N_mutations, BIC, mean_DP, sd_DP) |> 
  head() |> 
  knitr::kable()

Table 2: Examples of clones fitted to the AMLRO dataset.
sample_id	component	cellularity	N_mutations	BIC	mean_DP	sd_DP
AMLRO-1_Dx	Clone	0.7492541	328	37002.14	61.54762	15.32128
AMLRO-1_Dx	Subclone 1	0.4964259	19290	37002.14	58.36380	14.47672
AMLRO-1_Dx	Subclone 2	0.1759814	3296	37002.14	54.44811	47.91583
AMLRO-1_Rx	Clone	0.5192472	8522	35519.95	60.30717	13.05149
AMLRO-1_Rx	Subclone 1	0.4836764	10938	35519.95	60.36889	13.17624
AMLRO-1_Rx	Subclone 2	0.1803055	3242	35519.95	56.47942	59.49609

For convenence the entire process, the process of model fitting can be run with a single command run_cevomod(), which takes a SNVs dataset (many samples can be provided to a single call of the function).

brca <- brca |> 
  run_cevomod()

Then, plot_models() function can be used to plot the results.

plot_models(amlro) +
  hide_legend() +
  labs(title = "")

plot_models(brca) +
  hide_legend() +
  labs(title = "")

To facilitate the exploration of cevomod results for the users not familiar with R we prepared a browser app written in Shiny (Figure 11). In order to use it one should save the cevomod results into $HOME/.cevomod/data.Rds with saveRDS() and then run the browser using run_browser() function (Listing 1)

Listing 1: Browser usage

saveRDS(amlro, "~/.cevomod/data.Rds")
run_browser()

Additional plots

Main poster plot

Let’s use AMLRO-3_Dx sample with slightly different geoms to make in look better.

cd <- amlro |> 
  filter(sample_id == "AMLRO-3_Dx")

neutral_models <- get_neutral_models(cd) |>
  select(.data$sample_id, .data$from, .data$to)

resid <- cevomod:::get_residuals(cd) |>
  left_join(neutral_models, by = "sample_id") |>
  group_by(.data$sample_id) |>
  mutate(ylim = Inf) |>
  ungroup() |>
  mutate(neutr = .data$VAF >= .data$from & .data$VAF <= .data$to)

colors <- c(
  data = "#00496FFF",
  `Neutral Tail` = "#0F85A0FF",
  Clones = "#ED8B00FF",
  `Final fit` = "#DD4124FF"
)

plot_SFS(cd, geom = "bar", fill = "#00496F", color = "#00496F", alpha = 0.8, show.legend = FALSE) +
  geom_area(
    aes(.data$VAF, .data$neutral_pred),
    data = resid |> filter(.data$neutral_pred < .data$ylim),
    color = "#0F85A0FF", size = 1, fill = "#0F85A0FF", alpha = 0.5
  ) +
  geom_area(
    aes(.data$VAF, .data$binom_pred),
    data = resid,
    color = "#ED8B00FF", size = 1, fill = "#ED8B00FF", alpha = 0.3
  ) +
  geom_line(
    aes(.data$VAF, .data$model_pred),
    data = resid |> filter(.data$neutral_pred < .data$ylim),
    size = 1, color = "#DD4124FF"
  ) +
  scale_color_manual(values = colors) +
  scale_fill_manual(values = colors) +
  coord_cartesian(ylim = c(0, 1400)) +
  labs(title = "")

Figure 12: Schema of the fitted model. While other methods like MOBSTER attempt to model the mutations from so called ‘Neutral tail’, ignoring the massive removal of low-frequency mutations by variants’ caller results in inability to model it.

Session Info

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Warsaw
 date     2022-11-15
 pandoc   2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 assertthat       0.2.1      2019-03-21 [1] CRAN (R 4.2.1)
 backports        1.4.1      2021-12-13 [1] CRAN (R 4.2.1)
 BiocGenerics     0.42.0     2022-04-26 [1] Bioconductor
 broom            1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
 cachem           1.0.6      2021-08-19 [1] CRAN (R 4.2.1)
 callr            3.7.1      2022-07-13 [1] CRAN (R 4.2.1)
 cellranger       1.1.0      2016-07-27 [1] CRAN (R 4.2.1)
 cevoDatasets   * 0.2.3      2022-10-12 [1] local
 cevomod        * 0.3.3      2022-11-11 [1] local
 circlize         0.4.15     2022-05-10 [1] CRAN (R 4.2.1)
 cli              3.3.0      2022-04-25 [1] CRAN (R 4.2.1)
 clue             0.3-61     2022-05-30 [1] CRAN (R 4.2.1)
 cluster          2.1.2      2021-04-17 [4] CRAN (R 4.1.1)
 codetools        0.2-18     2020-11-04 [4] CRAN (R 4.0.3)
 colorspace       2.0-3      2022-02-21 [1] CRAN (R 4.2.1)
 ComplexHeatmap   2.12.1     2022-08-09 [1] Bioconductor
 crayon           1.5.1      2022-03-26 [1] CRAN (R 4.2.1)
 DBI              1.1.3      2022-06-18 [1] CRAN (R 4.2.1)
 dbplyr           2.2.1      2022-06-27 [1] CRAN (R 4.2.1)
 devtools         2.4.4      2022-07-20 [1] CRAN (R 4.2.1)
 digest           0.6.29     2021-12-01 [1] CRAN (R 4.2.1)
 doParallel       1.0.17     2022-02-07 [1] CRAN (R 4.2.1)
 dplyr          * 1.0.9      2022-04-28 [1] CRAN (R 4.2.1)
 ellipsis         0.3.2      2021-04-29 [1] CRAN (R 4.2.1)
 evaluate         0.16       2022-08-09 [1] CRAN (R 4.2.1)
 fansi            1.0.3      2022-03-24 [1] CRAN (R 4.2.1)
 fastmap          1.1.0      2021-01-25 [1] CRAN (R 4.2.1)
 forcats        * 0.5.2      2022-08-19 [1] CRAN (R 4.2.1)
 foreach          1.5.2      2022-02-02 [1] CRAN (R 4.2.1)
 fs               1.5.2      2021-12-08 [1] CRAN (R 4.2.1)
 gargle           1.2.0      2021-07-02 [1] CRAN (R 4.2.1)
 generics         0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
 GetoptLong       1.0.5      2020-12-15 [1] CRAN (R 4.2.1)
 ggplot2        * 3.3.6      2022-05-03 [1] CRAN (R 4.2.1)
 GlobalOptions    0.1.2      2020-06-10 [1] CRAN (R 4.2.1)
 glue             1.6.2      2022-02-24 [1] CRAN (R 4.2.1)
 googledrive      2.0.0      2021-07-08 [1] CRAN (R 4.2.1)
 googlesheets4    1.0.1      2022-08-13 [1] CRAN (R 4.2.1)
 gtable           0.3.0      2019-03-25 [1] CRAN (R 4.2.1)
 haven            2.5.0      2022-04-15 [1] CRAN (R 4.2.1)
 hms              1.1.2      2022-08-19 [1] CRAN (R 4.2.1)
 htmltools        0.5.3      2022-07-18 [1] CRAN (R 4.2.1)
 htmlwidgets      1.5.4      2021-09-08 [1] CRAN (R 4.2.1)
 httpuv           1.6.5      2022-01-05 [1] CRAN (R 4.2.1)
 httr             1.4.4      2022-08-17 [1] CRAN (R 4.2.1)
 IRanges          2.30.1     2022-08-18 [1] Bioconductor
 iterators        1.0.14     2022-02-05 [1] CRAN (R 4.2.1)
 jsonlite         1.8.0      2022-02-22 [1] CRAN (R 4.2.1)
 knitr            1.39       2022-04-26 [1] CRAN (R 4.2.1)
 later            1.3.0      2021-08-18 [1] CRAN (R 4.2.1)
 lifecycle        1.0.1      2021-09-24 [1] CRAN (R 4.2.1)
 lubridate        1.8.0      2021-10-07 [1] CRAN (R 4.2.1)
 magrittr         2.0.3      2022-03-30 [1] CRAN (R 4.2.1)
 matrixStats      0.62.0     2022-04-19 [1] CRAN (R 4.2.1)
 mclust           5.4.10     2022-05-20 [1] CRAN (R 4.2.1)
 memoise          2.0.1      2021-11-26 [1] CRAN (R 4.2.1)
 mime             0.12       2021-09-28 [1] CRAN (R 4.2.1)
 miniUI           0.1.1.1    2018-05-18 [1] CRAN (R 4.2.1)
 modelr           0.1.9      2022-08-19 [1] CRAN (R 4.2.1)
 munsell          0.5.0      2018-06-12 [1] CRAN (R 4.2.1)
 paletteer        1.4.1.9000 2022-08-22 [1] Github (EmilHvitfeldt/paletteer@aae3e75)
 pillar           1.8.1      2022-08-19 [1] CRAN (R 4.2.1)
 pkgbuild         1.3.1      2021-12-20 [1] CRAN (R 4.2.1)
 pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.2.1)
 pkgload          1.3.0      2022-06-27 [1] CRAN (R 4.2.1)
 png              0.1-7      2013-12-03 [1] CRAN (R 4.2.1)
 PNWColors        0.1.0      2022-08-22 [1] Github (jakelawlor/PNWColors@1e3ce47)
 prettyunits      1.1.1      2020-01-24 [1] CRAN (R 4.2.1)
 processx         3.7.0      2022-07-07 [1] CRAN (R 4.2.1)
 profvis          0.3.7      2020-11-02 [1] CRAN (R 4.2.1)
 promises         1.2.0.1    2021-02-11 [1] CRAN (R 4.2.1)
 ps               1.7.1      2022-06-18 [1] CRAN (R 4.2.1)
 purrr          * 0.3.4      2020-04-17 [1] CRAN (R 4.2.1)
 R6               2.5.1      2021-08-19 [1] CRAN (R 4.2.1)
 RColorBrewer     1.1-3      2022-04-03 [1] CRAN (R 4.2.1)
 Rcpp             1.0.9      2022-07-08 [1] CRAN (R 4.2.1)
 readr          * 2.1.2      2022-01-30 [1] CRAN (R 4.2.1)
 readxl           1.4.1      2022-08-17 [1] CRAN (R 4.2.1)
 rematch2         2.1.2      2020-05-01 [1] CRAN (R 4.2.1)
 remotes          2.4.2      2021-11-30 [1] CRAN (R 4.2.1)
 reprex           2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
 rjson            0.2.21     2022-01-09 [1] CRAN (R 4.2.1)
 rlang            1.0.4      2022-07-12 [1] CRAN (R 4.2.1)
 rmarkdown        2.15       2022-08-16 [1] CRAN (R 4.2.1)
 rstudioapi       0.13       2020-11-12 [1] CRAN (R 4.2.1)
 rvest            1.0.3      2022-08-19 [1] CRAN (R 4.2.1)
 S4Vectors        0.34.0     2022-04-26 [1] Bioconductor
 scales           1.2.1      2022-08-20 [1] CRAN (R 4.2.1)
 sessioninfo      1.2.2      2021-12-06 [1] CRAN (R 4.2.1)
 shape            1.4.6      2021-05-19 [1] CRAN (R 4.2.1)
 shiny            1.7.2      2022-07-19 [1] CRAN (R 4.2.1)
 stringi          1.7.8      2022-07-11 [1] CRAN (R 4.2.1)
 stringr        * 1.4.1      2022-08-20 [1] CRAN (R 4.2.1)
 tibble         * 3.1.8      2022-07-22 [1] CRAN (R 4.2.1)
 tidyr          * 1.2.0      2022-02-01 [1] CRAN (R 4.2.1)
 tidyselect       1.1.2      2022-02-21 [1] CRAN (R 4.2.1)
 tidyverse      * 1.3.2      2022-07-18 [1] CRAN (R 4.2.1)
 tzdb             0.3.0      2022-03-28 [1] CRAN (R 4.2.1)
 urlchecker       1.0.1      2021-11-30 [1] CRAN (R 4.2.1)
 usethis          2.1.6      2022-05-25 [1] CRAN (R 4.2.1)
 utf8             1.2.2      2021-07-24 [1] CRAN (R 4.2.1)
 vctrs            0.4.1      2022-04-13 [1] CRAN (R 4.2.1)
 withr            2.5.0      2022-03-03 [1] CRAN (R 4.2.1)
 xfun             0.32       2022-08-10 [1] CRAN (R 4.2.1)
 xml2             1.3.3      2021-11-30 [1] CRAN (R 4.2.1)
 xtable           1.8-4      2019-04-21 [1] CRAN (R 4.2.1)
 yaml             2.3.5      2022-02-21 [1] CRAN (R 4.2.1)

 [1] /home/pkus/R/x86_64-pc-linux-gnu-library/4.2
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

Bibliography

Caravagna, Giulio, Timon Heide, Marc J Williams, Luis Zapata, Daniel Nichol, Ketevan Chkhaidze, William Cross, et al. 2020. “Subclonal Reconstruction of Tumors by Using Machine Learning and Population Genetics.” Nat. Genet. 52 (9): 898–907. https://doi.org/10.1038/s41588-020-0675-5.

Shlush, Liran I, Amanda Mitchell, Lawrence Heisler, Sagi Abelson, Stanley W K Ng, Aaron Trotman-Grant, Jessie J F Medeiros, et al. 2017. “Tracing the Origins of Relapse in Acute Myeloid Leukaemia to Stem Cells.” Nature 547 (7661): 104–8. https://doi.org/doi.org/10.1038/nature22993.

Tarabichi, Maxime, Iñigo Martincorena, Moritz Gerstung, Armand M Leroi, Florian Markowetz, PCAWG Evolution and Heterogeneity Working Group, Paul T Spellman, et al. 2018. “Neutral Tumor Evolution?” Nat. Genet. 50 (12): 1630–33. https://doi.org/10.1038/s41588-018-0258-x.

Tung, Hwai-Ray, and Rick Durrett. 2021. “Signatures of Neutral Evolution in Exponentially Growing Tumors: A Theoretical Perspective.” PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1008701.

Williams, Marc J, Benjamin Werner, Chris P Barnes, Trevor A Graham, and Andrea Sottoriva. 2016. “Identification of Neutral Tumor Evolution Across Cancer Types.” Nat. Genet. 48 (3): 238–44. https://doi.org/10.1038/ng.3489.