Sampling-oriented modelling of cancer evolution

Author

Paweł Kuś

Published

November 15, 2022

rm(list = ls())
suppressPackageStartupMessages({
  library(cevoDatasets)
  library(cevomod)
  library(tidyverse)
})
theme_set(theme_minimal())

Introduction

Figure 1: Schema of the fitted model. While other methods like MOBSTER attempt to model the mutations from so called ‘Neutral tail’, ignoring the massive removal of low-frequency mutations by variants’ caller results in inability to model it.

The rapid development of Next Generation Sequencing (NGS) methods resulted in the development of many methods for the deconvolution of tumor subclonal structure from the NGS data. Whereas many of them figure out tumor subclonal structure by clustering the mutations with similar allelic frequencies, recently published MOBSTER (Caravagna et al. 2020) applies the population genetics model not only to recognize the tumor clonal structure but also to infer the evolutionary parameters of identified subclones. Despite the criticism (Tarabichi et al. 2018; Tung and Durrett 2021) of the model’s assumptions, it offers an interesting way to investigate the process of tumor evolution.

An important limitation of this method is its requirement for deeply sequenced Whole Genome Sequencing (WGS) data. MOBSTER might be unable to recognize the low-frequency neutral mutations tail if many of them were removed by the variant calling algorithm due to data quality issues like sequencing depth. In our work, we reimplemented this model aware of this sampling event to fit the power- law neutral mutations tail into Whole Exome Sequencing data (WES), cheaper and more often available than WGS data.

Materials

We used the sequencing results from two sets of data:

  • Breast Cancer study (unpublished yet, denoted BRCA)
    • 10 patients
    • 2-3 tumor samples + 1 control sample per patient
    • Whole Exome Sequencing
  • Acute Myleoid Leukaemia study (Shlush et al. 2017) (AMLRO)
    • 11 patients
    • 2 tumor samples + 1 control sample per patient
    • Whole Genome Sequencing

Site Frequency Spectras for all the samples are shown in the Figure 2

amlro <- calc_SFS(AMLRO)
brca <- calc_SFS(OPUS_BRCA)

plot_SFS(amlro, geom = "bar") +
  hide_legend() +
  facet_wrap(~sample_id, scales = "free_y") +
  labs(title = "")

plot_SFS(brca, geom = "bar") +
  hide_legend() +
  facet_wrap(~sample_id, scales = "free_y") +
  labs(title = "")

(a) AMLRO

(b) BRCA

Figure 2: Site Frequency Spectras for the analysed datasets

MOBSTER

We used MOBSTER (Caravagna et al. 2020) to identify clusters of neutral tail, clonal, and subclonal mutations. As shown in the figure Figure 3, MOBSTER was unable to recognize the neutral tail of mutations in any sample.

MOBSTER results for AMLRO dataset

MOBSTER results for BRCA dataset

Figure 3: MOBSTER results for datasets. Colors: red - clonal mutations, blue - subclonal mutations. No mutations were classified to the neutral tails (white in MOBSTER color convention) - MOBSTER was unable to recognize those gropus of mutations due to the removal of low-frequency and low-evidence mutations.

cevomod

In cevomod we fit the MOBSTER-like mixture of power-law and binomial distributions in several steps.

Neutral tail fitting

First, we use (Williams et al. 2016) formula and plot the cumulative number of mutations \(M(f)\) against the reciprocal of the mutations’ frequency \(1/f\) (Figure 4).

amlro <- calc_Mf_1f(amlro)
amlro |> 
  plot_Mf_1f(scale = FALSE) +
  facet_wrap(~sample_id, scales = "free") +
  hide_legend()

Figure 4: \(M(f) \sim 1/f\) plots from samples in AMLRO dataset

The relationship \(M(f) \sim 1/f\) should be linear for the mutations in the neutrally growing population, thus its slope can be used to fit the power-law curver of the neutral tail mutations on the Site Frequency Spectra plot.

We use sliding window algorithm to fit a number of linear \(M(f) = \alpha/f\) models to the 0.05-wide ranges of allelic frequencies and filter out the fits with R_squared lower than 0.98 (Williams et al. 2016). From the remaining models, we select the one with the lowest \(\alpha\) (Figure 5 and Figure 6, Table 1)

amlro <- fit_neutral_models(amlro)
amlro |> 
  plot_Mf_1f(scale = FALSE, alpha = 0.5) +
  layer_lm_fits(amlro) +
  facet_wrap(~sample_id, scales = "free") +
  hide_legend()

Figure 5: Fitted linear models with the lowest slope coefficients

amlro |> 
  plot_neutral_A_coefficients() +
  labs(
    x = "Variant Allele Frequency",
    y = "alpha coefficient",
    color = "selected coefficient"
  )

Figure 6: Slope coefficients for \(M(f) = \alpha/f\) models fitted to different ranges of allelic frequencies. Line-ends on the X-axis represent the Variant Allele Frequency interval considered by the model, Y-values represent the slope coefficients.

amlro$models$neutral_models |> 
  filter(best) |> 
  select(sample_id, component, slope = A, intercept = b) |>
  head() |> 
  knitr::kable()
Table 1: Coefficients of the selected neutral tail models
sample_id component slope intercept
AMLRO-1_Dx Neutral tail 839.9501 22202.16263
AMLRO-1_Rx Neutral tail 865.9478 21987.18911
AMLRO-10_Dx Neutral tail 404.5073 81.72711
AMLRO-10_Rx Neutral tail 462.5176 736.69332
AMLRO-11_Dx Neutral tail 335.2720 1503.61682
AMLRO-11_Rx Neutral tail 301.7115 675.66577

Now we can use the selected \(\alpha\) coefficients to draw the power-law curves of the neutral tails on the SFS plots (Figure 7). Since we select the lowest fitted \(\alpha\), values represented by the curve should be higher or equal to the real SFS for most of the possible VAF values, with the exception of the lowest VAF range where the curve detaches from the SFS and predicts the numbers of mutations from expected from the model.

amlro |> 
  plot_models() + 
  hide_legend()

Figure 7: Site Frequency Spectra for AMLRO samples with the fitted neutral tails

Sampling rate

Since the residuals of the neutral model for the lowest allelic frequencies represent the difference between the predicted and real number of low-frequency mutations, we can use those values to estimate the sampling rate, fraction of mutations filtered out by the variant caller due to low evidence reasons (Figure 8).

amlro |> 
  plot_sampling_rate()

Figure 8: Sampling rate. Fractions of mutations filtered out by variant callers as function of \(f\)

Fitting clonal distributions

Model residuals for higher VAF values represent the numbers of mutations unpredicted by the neutral model. Thus, those values can be used to identify clones and sublones in the sample (Figure 9; Table 2)

amlro <- fit_subclones(amlro)

plot_residuals_neutral_model(amlro) +
  facet_wrap(~sample_id, scales = "free_y") +
  hide_legend()

Figure 9: Binomial distributions representing clones/subclones in the AMLRO dataset fitted to the neutral model residuals.

amlro$models$binomial_models |> 
  filter(best) |> 
  select(sample_id, component, cellularity, N_mutations, BIC, mean_DP, sd_DP) |> 
  head() |> 
  knitr::kable()
Table 2: Examples of clones fitted to the AMLRO dataset.
sample_id component cellularity N_mutations BIC mean_DP sd_DP
AMLRO-1_Dx Clone 0.7492541 328 37002.14 61.54762 15.32128
AMLRO-1_Dx Subclone 1 0.4964259 19290 37002.14 58.36380 14.47672
AMLRO-1_Dx Subclone 2 0.1759814 3296 37002.14 54.44811 47.91583
AMLRO-1_Rx Clone 0.5192472 8522 35519.95 60.30717 13.05149
AMLRO-1_Rx Subclone 1 0.4836764 10938 35519.95 60.36889 13.17624
AMLRO-1_Rx Subclone 2 0.1803055 3242 35519.95 56.47942 59.49609

For convenence the entire process, the process of model fitting can be run with a single command run_cevomod(), which takes a SNVs dataset (many samples can be provided to a single call of the function).

brca <- brca |> 
  run_cevomod()

Then, plot_models() function can be used to plot the results.

plot_models(amlro) +
  hide_legend() +
  labs(title = "")

plot_models(brca) +
  hide_legend() +
  labs(title = "")

(a) AMLRO

(b) BRCA

Figure 10: Fitted models for the analysed datasets

To facilitate the exploration of cevomod results for the users not familiar with R we prepared a browser app written in Shiny (Figure 11). In order to use it one should save the cevomod results into $HOME/.cevomod/data.Rds with saveRDS() and then run the browser using run_browser() function (Listing 1)

Listing 1: Browser usage

saveRDS(amlro, "~/.cevomod/data.Rds")
run_browser()

Figure 11: cevomod results browser

Additional plots

Main poster plot

Let’s use AMLRO-3_Dx sample with slightly different geoms to make in look better.

cd <- amlro |> 
  filter(sample_id == "AMLRO-3_Dx")

neutral_models <- get_neutral_models(cd) |>
  select(.data$sample_id, .data$from, .data$to)

resid <- cevomod:::get_residuals(cd) |>
  left_join(neutral_models, by = "sample_id") |>
  group_by(.data$sample_id) |>
  mutate(ylim = Inf) |>
  ungroup() |>
  mutate(neutr = .data$VAF >= .data$from & .data$VAF <= .data$to)

colors <- c(
  data = "#00496FFF",
  `Neutral Tail` = "#0F85A0FF",
  Clones = "#ED8B00FF",
  `Final fit` = "#DD4124FF"
)

plot_SFS(cd, geom = "bar", fill = "#00496F", color = "#00496F", alpha = 0.8, show.legend = FALSE) +
  geom_area(
    aes(.data$VAF, .data$neutral_pred),
    data = resid |> filter(.data$neutral_pred < .data$ylim),
    color = "#0F85A0FF", size = 1, fill = "#0F85A0FF", alpha = 0.5
  ) +
  geom_area(
    aes(.data$VAF, .data$binom_pred),
    data = resid,
    color = "#ED8B00FF", size = 1, fill = "#ED8B00FF", alpha = 0.3
  ) +
  geom_line(
    aes(.data$VAF, .data$model_pred),
    data = resid |> filter(.data$neutral_pred < .data$ylim),
    size = 1, color = "#DD4124FF"
  ) +
  scale_color_manual(values = colors) +
  scale_fill_manual(values = colors) +
  coord_cartesian(ylim = c(0, 1400)) +
  labs(title = "")

Figure 12: Schema of the fitted model. While other methods like MOBSTER attempt to model the mutations from so called ‘Neutral tail’, ignoring the massive removal of low-frequency mutations by variants’ caller results in inability to model it.

Session Info

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Warsaw
 date     2022-11-15
 pandoc   2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 assertthat       0.2.1      2019-03-21 [1] CRAN (R 4.2.1)
 backports        1.4.1      2021-12-13 [1] CRAN (R 4.2.1)
 BiocGenerics     0.42.0     2022-04-26 [1] Bioconductor
 broom            1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
 cachem           1.0.6      2021-08-19 [1] CRAN (R 4.2.1)
 callr            3.7.1      2022-07-13 [1] CRAN (R 4.2.1)
 cellranger       1.1.0      2016-07-27 [1] CRAN (R 4.2.1)
 cevoDatasets   * 0.2.3      2022-10-12 [1] local
 cevomod        * 0.3.3      2022-11-11 [1] local
 circlize         0.4.15     2022-05-10 [1] CRAN (R 4.2.1)
 cli              3.3.0      2022-04-25 [1] CRAN (R 4.2.1)
 clue             0.3-61     2022-05-30 [1] CRAN (R 4.2.1)
 cluster          2.1.2      2021-04-17 [4] CRAN (R 4.1.1)
 codetools        0.2-18     2020-11-04 [4] CRAN (R 4.0.3)
 colorspace       2.0-3      2022-02-21 [1] CRAN (R 4.2.1)
 ComplexHeatmap   2.12.1     2022-08-09 [1] Bioconductor
 crayon           1.5.1      2022-03-26 [1] CRAN (R 4.2.1)
 DBI              1.1.3      2022-06-18 [1] CRAN (R 4.2.1)
 dbplyr           2.2.1      2022-06-27 [1] CRAN (R 4.2.1)
 devtools         2.4.4      2022-07-20 [1] CRAN (R 4.2.1)
 digest           0.6.29     2021-12-01 [1] CRAN (R 4.2.1)
 doParallel       1.0.17     2022-02-07 [1] CRAN (R 4.2.1)
 dplyr          * 1.0.9      2022-04-28 [1] CRAN (R 4.2.1)
 ellipsis         0.3.2      2021-04-29 [1] CRAN (R 4.2.1)
 evaluate         0.16       2022-08-09 [1] CRAN (R 4.2.1)
 fansi            1.0.3      2022-03-24 [1] CRAN (R 4.2.1)
 fastmap          1.1.0      2021-01-25 [1] CRAN (R 4.2.1)
 forcats        * 0.5.2      2022-08-19 [1] CRAN (R 4.2.1)
 foreach          1.5.2      2022-02-02 [1] CRAN (R 4.2.1)
 fs               1.5.2      2021-12-08 [1] CRAN (R 4.2.1)
 gargle           1.2.0      2021-07-02 [1] CRAN (R 4.2.1)
 generics         0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
 GetoptLong       1.0.5      2020-12-15 [1] CRAN (R 4.2.1)
 ggplot2        * 3.3.6      2022-05-03 [1] CRAN (R 4.2.1)
 GlobalOptions    0.1.2      2020-06-10 [1] CRAN (R 4.2.1)
 glue             1.6.2      2022-02-24 [1] CRAN (R 4.2.1)
 googledrive      2.0.0      2021-07-08 [1] CRAN (R 4.2.1)
 googlesheets4    1.0.1      2022-08-13 [1] CRAN (R 4.2.1)
 gtable           0.3.0      2019-03-25 [1] CRAN (R 4.2.1)
 haven            2.5.0      2022-04-15 [1] CRAN (R 4.2.1)
 hms              1.1.2      2022-08-19 [1] CRAN (R 4.2.1)
 htmltools        0.5.3      2022-07-18 [1] CRAN (R 4.2.1)
 htmlwidgets      1.5.4      2021-09-08 [1] CRAN (R 4.2.1)
 httpuv           1.6.5      2022-01-05 [1] CRAN (R 4.2.1)
 httr             1.4.4      2022-08-17 [1] CRAN (R 4.2.1)
 IRanges          2.30.1     2022-08-18 [1] Bioconductor
 iterators        1.0.14     2022-02-05 [1] CRAN (R 4.2.1)
 jsonlite         1.8.0      2022-02-22 [1] CRAN (R 4.2.1)
 knitr            1.39       2022-04-26 [1] CRAN (R 4.2.1)
 later            1.3.0      2021-08-18 [1] CRAN (R 4.2.1)
 lifecycle        1.0.1      2021-09-24 [1] CRAN (R 4.2.1)
 lubridate        1.8.0      2021-10-07 [1] CRAN (R 4.2.1)
 magrittr         2.0.3      2022-03-30 [1] CRAN (R 4.2.1)
 matrixStats      0.62.0     2022-04-19 [1] CRAN (R 4.2.1)
 mclust           5.4.10     2022-05-20 [1] CRAN (R 4.2.1)
 memoise          2.0.1      2021-11-26 [1] CRAN (R 4.2.1)
 mime             0.12       2021-09-28 [1] CRAN (R 4.2.1)
 miniUI           0.1.1.1    2018-05-18 [1] CRAN (R 4.2.1)
 modelr           0.1.9      2022-08-19 [1] CRAN (R 4.2.1)
 munsell          0.5.0      2018-06-12 [1] CRAN (R 4.2.1)
 paletteer        1.4.1.9000 2022-08-22 [1] Github (EmilHvitfeldt/paletteer@aae3e75)
 pillar           1.8.1      2022-08-19 [1] CRAN (R 4.2.1)
 pkgbuild         1.3.1      2021-12-20 [1] CRAN (R 4.2.1)
 pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.2.1)
 pkgload          1.3.0      2022-06-27 [1] CRAN (R 4.2.1)
 png              0.1-7      2013-12-03 [1] CRAN (R 4.2.1)
 PNWColors        0.1.0      2022-08-22 [1] Github (jakelawlor/PNWColors@1e3ce47)
 prettyunits      1.1.1      2020-01-24 [1] CRAN (R 4.2.1)
 processx         3.7.0      2022-07-07 [1] CRAN (R 4.2.1)
 profvis          0.3.7      2020-11-02 [1] CRAN (R 4.2.1)
 promises         1.2.0.1    2021-02-11 [1] CRAN (R 4.2.1)
 ps               1.7.1      2022-06-18 [1] CRAN (R 4.2.1)
 purrr          * 0.3.4      2020-04-17 [1] CRAN (R 4.2.1)
 R6               2.5.1      2021-08-19 [1] CRAN (R 4.2.1)
 RColorBrewer     1.1-3      2022-04-03 [1] CRAN (R 4.2.1)
 Rcpp             1.0.9      2022-07-08 [1] CRAN (R 4.2.1)
 readr          * 2.1.2      2022-01-30 [1] CRAN (R 4.2.1)
 readxl           1.4.1      2022-08-17 [1] CRAN (R 4.2.1)
 rematch2         2.1.2      2020-05-01 [1] CRAN (R 4.2.1)
 remotes          2.4.2      2021-11-30 [1] CRAN (R 4.2.1)
 reprex           2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
 rjson            0.2.21     2022-01-09 [1] CRAN (R 4.2.1)
 rlang            1.0.4      2022-07-12 [1] CRAN (R 4.2.1)
 rmarkdown        2.15       2022-08-16 [1] CRAN (R 4.2.1)
 rstudioapi       0.13       2020-11-12 [1] CRAN (R 4.2.1)
 rvest            1.0.3      2022-08-19 [1] CRAN (R 4.2.1)
 S4Vectors        0.34.0     2022-04-26 [1] Bioconductor
 scales           1.2.1      2022-08-20 [1] CRAN (R 4.2.1)
 sessioninfo      1.2.2      2021-12-06 [1] CRAN (R 4.2.1)
 shape            1.4.6      2021-05-19 [1] CRAN (R 4.2.1)
 shiny            1.7.2      2022-07-19 [1] CRAN (R 4.2.1)
 stringi          1.7.8      2022-07-11 [1] CRAN (R 4.2.1)
 stringr        * 1.4.1      2022-08-20 [1] CRAN (R 4.2.1)
 tibble         * 3.1.8      2022-07-22 [1] CRAN (R 4.2.1)
 tidyr          * 1.2.0      2022-02-01 [1] CRAN (R 4.2.1)
 tidyselect       1.1.2      2022-02-21 [1] CRAN (R 4.2.1)
 tidyverse      * 1.3.2      2022-07-18 [1] CRAN (R 4.2.1)
 tzdb             0.3.0      2022-03-28 [1] CRAN (R 4.2.1)
 urlchecker       1.0.1      2021-11-30 [1] CRAN (R 4.2.1)
 usethis          2.1.6      2022-05-25 [1] CRAN (R 4.2.1)
 utf8             1.2.2      2021-07-24 [1] CRAN (R 4.2.1)
 vctrs            0.4.1      2022-04-13 [1] CRAN (R 4.2.1)
 withr            2.5.0      2022-03-03 [1] CRAN (R 4.2.1)
 xfun             0.32       2022-08-10 [1] CRAN (R 4.2.1)
 xml2             1.3.3      2021-11-30 [1] CRAN (R 4.2.1)
 xtable           1.8-4      2019-04-21 [1] CRAN (R 4.2.1)
 yaml             2.3.5      2022-02-21 [1] CRAN (R 4.2.1)

 [1] /home/pkus/R/x86_64-pc-linux-gnu-library/4.2
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

Bibliography

Caravagna, Giulio, Timon Heide, Marc J Williams, Luis Zapata, Daniel Nichol, Ketevan Chkhaidze, William Cross, et al. 2020. “Subclonal Reconstruction of Tumors by Using Machine Learning and Population Genetics.” Nat. Genet. 52 (9): 898–907. https://doi.org/10.1038/s41588-020-0675-5.
Shlush, Liran I, Amanda Mitchell, Lawrence Heisler, Sagi Abelson, Stanley W K Ng, Aaron Trotman-Grant, Jessie J F Medeiros, et al. 2017. “Tracing the Origins of Relapse in Acute Myeloid Leukaemia to Stem Cells.” Nature 547 (7661): 104–8. https://doi.org/doi.org/10.1038/nature22993.
Tarabichi, Maxime, Iñigo Martincorena, Moritz Gerstung, Armand M Leroi, Florian Markowetz, PCAWG Evolution and Heterogeneity Working Group, Paul T Spellman, et al. 2018. “Neutral Tumor Evolution?” Nat. Genet. 50 (12): 1630–33. https://doi.org/10.1038/s41588-018-0258-x.
Tung, Hwai-Ray, and Rick Durrett. 2021. “Signatures of Neutral Evolution in Exponentially Growing Tumors: A Theoretical Perspective.” PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1008701.
Williams, Marc J, Benjamin Werner, Chris P Barnes, Trevor A Graham, and Andrea Sottoriva. 2016. “Identification of Neutral Tumor Evolution Across Cancer Types.” Nat. Genet. 48 (3): 238–44. https://doi.org/10.1038/ng.3489.