Data preparation
<cevodata>
S3 class
Most of the cevomod functions use the <cevodata>
S3 class which is designed to store both, the data and the cevomod
result. The <cevodata>
object can be easily created
using the init_cevodata(name)
constructor and populated
with data using the add_*_data()
methods.
In this tutorial we will use the test_data
dataset
provided with cevomod, which contains 3 male samples and 1 female
sample.
suppressPackageStartupMessages({
library(cevomod)
library(tidyverse)
})
theme_set(theme_minimal())
snvs <- SNVs(test_data)
cnvs <- CNVs(test_data)
sample_data <- test_data$metadata
SNVs tibble should contain at least sample_id, chrom, pos and VAF columns. Any other columns are optional.
snvs
#> <cevo_snvs> tibble
#> # A tibble: 16,000 × 11
#> sample_id chrom pos gene_symbol ref alt ref_reads alt_reads impact
#> * <chr> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 Sample 1 chr1 1 NA NA NA 25 29 NA
#> 2 Sample 1 chr1 2 NA NA NA 22 19 NA
#> 3 Sample 1 chr1 3 NA NA NA 21 20 NA
#> 4 Sample 1 chr1 4 NA NA NA 34 2 NA
#> 5 Sample 1 chr1 5 NA NA NA 62 1 NA
#> 6 Sample 1 chr1 6 NA NA NA 4 3 NA
#> 7 Sample 1 chr1 7 NA NA NA 33 34 NA
#> 8 Sample 1 chr1 8 NA NA NA 31 30 NA
#> 9 Sample 1 chr1 9 NA NA NA 25 17 NA
#> 10 Sample 1 chr1 10 NA NA NA 41 31 NA
#> # ℹ 15,990 more rows
#> # ℹ 2 more variables: VAF <dbl>, DP <int>
We will use also some sample metadata which associates samples to patients and stores the data on patients’ sex and samples’ purity.
sample_data
#> # A tibble: 4 × 4
#> sample_id patient_id sex purity
#> <chr> <chr> <chr> <dbl>
#> 1 Sample 1 Patient 1 male 1
#> 2 Sample 2 Patient 2 male 0.7
#> 3 Sample 3 Patient 3 female 1
#> 4 Sample 4 Patient 4 male 1
cevomod functions are pipe-oriented (in general), so we can create and populate the object through the small pipeline:
cd <- init_cevodata(name = "Demo data") |>
add_SNV_data(snvs) |>
add_sample_data(sample_data)
cd
#> <cevodata> dataset: Demo data
#> Genome: unknown
#> SNV assays: snvs (default)
#> CNV assays: None
#> 4 cases, 4 samples, 1 sample per case
#> 16000 mutations total, 4000 +/- 0 mutations per case
#> Active models:
name
can be any string that is informative for the
user.
To facilitate the use of cevomod with the data from popular
variant callers such as Mutect2, Strelka2, ASCAT, or FACETS, we have
implemented a readthis
package. readthis functions are designed for bulk reading of many
output variant files (they accept a path to a single file, named vector
of file paths, or a path to a directory containing many files). Data
objects read with readthis functions can be added to the cevodata object
with a single call of general add_data()
function.
For more information see the readthis
page.
Variant Frequency Spectra
cevomod fits models to the distributions of variant frequencies in the sample. Variant Allele Frequency is a basic measure of variant frequency provided by next-generation sequencing and it’s equal to the fraction of reads supporting the alternate allele in all reads covering the mutation site
\[VAF = \frac{alt\_reads}{alt\_reads + ref\_reads}\]
VAF spectra can be plotted using the plot_SFS()
function
(SFS - Site Frequency Spectrum). Like most of the cevomod
plotting functions, plot_SFS()
returns a ggplot object
which can be easily modified. Additional aesthetics can be added using
the aes()
and it can make use of the columns in the
metadata tibble, which is left-joined to the plot data.
plot_SFS(cd) +
aes(fill = sex) +
scale_fill_manual(values = c(male = "#DD4124", female = "#00496F")) +
labs(title = "Variant Allele Frequency Spectra")
#> Calculating SFS statistics
#> Calculating f intervals, using VAF column
#> Warning in geom_bar(join_aes(bar_mapping, mapping), stat = "identity", alpha =
#> alpha, : Ignoring unknown aesthetics: width
We can see that there are clear peaks of clonal mutations in all 4 samples. In 3 of them, the average frequency of the clonal variants equals 0.5, corresponding to a sample purity of 1. In Sample 2, the average frequency of clonal mutations is lower, which is in line with the lower purity of this sample:
cd$metadata |>
select(sample_id, purity)
#> # A tibble: 4 × 2
#> sample_id purity
#> <chr> <dbl>
#> 1 Sample 1 1
#> 2 Sample 2 0.7
#> 3 Sample 3 1
#> 4 Sample 4 1
We can also notice small groups of mutations with VAF close to 1.0, which likely result from the looses of heterozygocity.
Calculation of the Cancer Cell Fraction (optional)
Although VAF is often used to model cancer evolution from the bulk
sequencing data, it is not an accurate measure of the true mutation
frequency. VAF values are affected by sample purity and copy number
alterations, as we saw in the VAF spectrum above. If we have the
reliable measurements of sample purity and copy number values, we can
take them into account and calculate the Cancer Cell Fraction (CCF),
fraction of cancer cells in which the mutation is present. A convenient
formula for calculating CCF was published by Dentro et al. in Principles of
Reconstructing the Subclonal Architecture of Cancers (2015). To
calculate CCF, cevomod requires the purity
column in the
metadata tibble and the CNVs tibble with the following columns:
- sample_id
- chrom
- start
- end
- total_cn - containing information about the total number of sequence copies,
- normal_cn - containing information about the sequence ploidy in normal cells.
cnvs
#> # A tibble: 8 × 8
#> sample_id chrom start end total_cn major_cn minor_cn normal_cn
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sample 1 chr1 1 4000 2 1 0 2
#> 2 Sample 1 chr2 1 4000 1 1 0 2
#> 3 Sample 2 chr1 1 4000 2 1 0 2
#> 4 Sample 2 chr2 1 4000 1 1 0 2
#> 5 Sample 3 chr1 1 4000 2 1 0 2
#> 6 Sample 3 chr2 1 4000 1 1 0 2
#> 7 Sample 4 chr1 1 4000 2 1 0 2
#> 8 Sample 4 chr2 1 4000 1 1 0 2
We can add the CNV data to the cevodata object using the
add_CNV_data()
function.
cd <- cd |>
add_CNV_data(cnvs)
cd
#> <cevodata> dataset: Demo data
#> Genome: unknown
#> SNV assays: snvs (default)
#> CNV assays: cnvs (default)
#> 4 cases, 4 samples, 1 sample per case
#> 16000 mutations total, 4000 +/- 0 mutations per case
#> Active models:
Now the CCF values can be calculated using the
calc_mutation_frequencies()
function:
cd <- cd |>
calc_mutation_frequencies()
#> 0 variants (0 %), have NA CCF value
which will add CCF and CCF/2 columns to the SNVs tibble. cevomod will use the CCF/2 values by default starting from now.
plot_SFS(cd) +
aes(fill = sex) +
scale_fill_manual(values = c(male = "#DD4124", female = "#00496F"))
#> Calculating SFS statistics
#> Calculating f intervals, using CCF/2 column
#> Warning in geom_bar(join_aes(bar_mapping, mapping), stat = "identity", alpha =
#> alpha, : Ignoring unknown aesthetics: width
In the new mutation frequency spectra, we can see that an average CCF/2 values of clonal peaks are now equal to 0.5 in all samples, and that no mutation has the CCF/2 value close to 1.
Intervalization of variant frequencies
Before fitting the models, we need to binarize the mutation frequency
values, which we can do using the
intervalize_mutation_frequencies()
function. cevomod will
use CCF/2 values if CCF has been calculated before, and VAF if the CCF
column does not is not found in the tibble.
intervalize_mutation_frequencies()
adds the f
and f_interval
column to the SNVs tibble. By default, the
number of bins is equal to the median sequencing coverage of the
variants in the sample. This allows to reduce aliasing noise in samples
with low sequencing depth and to analyze samples with higher coverage
with higher resolution. The desired number of bins can also be specified
manually using the bins
argument. We will also calculate
the SFS spectra in this step, so the downstream functions do not have to
do it on their own.
cd <- cd |>
intervalize_mutation_frequencies() |>
calc_SFS()
#> Calculating f intervals, using CCF/2 column
#> Calculating SFS statistics
SNVs(cd) |>
glimpse()
#> Rows: 16,000
#> Columns: 15
#> $ sample_id <chr> "Sample 1", "Sample 1", "Sample 1", "Sample 1", "Sample 1"…
#> $ chrom <chr> "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "c…
#> $ pos <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ gene_symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ ref <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ alt <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ ref_reads <dbl> 25, 22, 21, 34, 62, 4, 33, 31, 25, 41, 42, 64, 32, 29, 42,…
#> $ alt_reads <dbl> 29, 19, 20, 2, 1, 3, 34, 30, 17, 31, 1, 1, 25, 31, 40, 1, …
#> $ impact <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ VAF <dbl> 0.536, 0.462, 0.488, 0.078, 0.030, 0.444, 0.507, 0.483, 0.…
#> $ CCF <dbl> 1.072, 0.924, 0.976, 0.156, 0.060, 0.888, 1.014, 0.966, 0.…
#> $ `CCF/2` <dbl> 0.536, 0.462, 0.488, 0.078, 0.030, 0.444, 0.507, 0.483, 0.…
#> $ f_interval <chr> "(0.529,0.549]", "(0.451,0.471]", "(0.471,0.49]", "(0.0588…
#> $ f <dbl> 0.5390, 0.4610, 0.4805, 0.0686, 0.0294, 0.4410, 0.5000, 0.…
#> $ DP <int> 54, 41, 41, 36, 63, 7, 67, 61, 42, 72, 43, 65, 57, 60, 82,…
Now we are ready to fit the cevomod models.
Model fitting
Once the SNVs are prepared, the models can be fitted using the
fit_*()
functions. Full models consist of the power-law
component and one or more binomial components, which are fitted
sequentially.
In this example, we first fit the power-law model with an exponent of 2, and then the mixture of binomial models. By default, clonal and subclonal components are fitted using BMix package (Caravagna et al., 2020). See other methods for fitting these components in Fitting models/Binomial components vignette.
cd <- cd |>
fit_powerlaw_tail_fixed() |>
fit_subclones()
#> Fitting binomial models using BMix
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'BMix'
#> Warning: replacing previous import 'crayon::%+%' by 'ggplot2::%+%' when loading
#> 'BMix'
#> ✔ Loading BMix, 'Binomial and Beta-Binomial univariate mixtures'. Support : <https://caravagnalab.github.io/BMix/>
#> Fitting williams neutral models...
#> Mf_1f's not calculated yet. Calculating with default bins
#> Calculating Williams's M(f) ~ 1/f statistics, using CCF/2 column
#> Warning: There was 1 warning in `reframe()`.
#> ℹ In argument: `fit_binomial_models_BMix(.data$data, N, pb, verbose)`.
#> ℹ In row 1.
#> Caused by warning:
#> ! replacing previous import 'cli::num_ansi_colors' by 'crayon::num_ansi_colors' when loading 'easypar'
cd
#> <cevodata> dataset: Demo data
#> Genome: unknown
#> SNV assays: snvs (default)
#> CNV assays: cnvs (default)
#> 4 cases, 4 samples, 1 sample per case
#> 16000 mutations total, 4000 +/- 0 mutations per case
#> Active models: powerlaw_fixed_subclones
A list of all fitted models can be obtained with:
get_model_names(cd)
#> [1] "SFS" "powerlaw_fixed"
#> [3] "powerlaw_fixed_subclones"
and the last fitted model is the active one:
active_models(cd)
#> [1] "powerlaw_fixed_subclones"
Models can be viewed with the get_models()
function:
get_models(cd, which = "powerlaw_fixed")
#> # A tibble: 4 × 11
#> sample_id model component from to length A b alpha rsquared best
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 Sample 1 power… Neutral … 0.045 0.095 0.05 46.6 2339. 2 1.00 TRUE
#> 2 Sample 2 power… Neutral … 0.245 0.295 0.05 15.1 360. 2 0.989 TRUE
#> 3 Sample 3 power… Neutral … 0.285 0.335 0.0500 104. 609. 2 0.986 TRUE
#> 4 Sample 4 power… Neutral … 0.245 0.295 0.0500 78.8 747. 2 0.996 TRUE
The active model is returned if the which
argument is
left empty:
get_models(cd)
#> # A tibble: 13 × 16
#> sample_id model component from to length A b alpha rsquared
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sample 1 binomia… Clone NA NA NA NA NA NA NA
#> 2 Sample 1 powerla… Neutral … 0.045 0.095 0.05 46.6 2339. 2 1.00
#> 3 Sample 2 binomia… Clone NA NA NA NA NA NA NA
#> 4 Sample 2 binomia… Subclone… NA NA NA NA NA NA NA
#> 5 Sample 2 powerla… Neutral … 0.245 0.295 0.05 15.1 360. 2 0.989
#> 6 Sample 3 binomia… Clone NA NA NA NA NA NA NA
#> 7 Sample 3 binomia… Subclone… NA NA NA NA NA NA NA
#> 8 Sample 3 binomia… Subclone… NA NA NA NA NA NA NA
#> 9 Sample 3 powerla… Neutral … 0.285 0.335 0.0500 104. 609. 2 0.986
#> 10 Sample 4 binomia… Clone NA NA NA NA NA NA NA
#> 11 Sample 4 binomia… Subclone… NA NA NA NA NA NA NA
#> 12 Sample 4 binomia… Subclone… NA NA NA NA NA NA NA
#> 13 Sample 4 powerla… Neutral … 0.245 0.295 0.0500 78.8 747. 2 0.996
#> # ℹ 6 more variables: best <lgl>, N <int>, cellularity <dbl>,
#> # N_mutations <int>, BIC <dbl>, sequencing_DP <dbl>
The tibble lists the components for all models and samples. Binomial components are called either the clones or the subclones. The component with the highest cellular frequency is called the clone, and the remaining components are called subclones. This distinction is important when estimating the subclonal evolutionary parameters: the main clones are not under positive selection in the tumor, so we only estimate the evolutionary parameters for the subclones.
Plot models
The fitted models can be visualized with the
plot_models()
function:
plot_models(cd) +
aes(fill = sex) +
scale_fill_manual(values = c(male = "#DD4124", female = "#00496F")) +
labs(title = "cevomod models")
#> Warning in geom_bar(join_aes(bar_mapping, mapping), stat = "identity", alpha =
#> alpha, : Ignoring unknown aesthetics: width
The model fitted to sample 2 is very inaccurate. The slope of the distribution is too steep to be approximated by the power-law exponent of 2. This sample should not be fitted with the model with the fixed power-law component. We will ignore this fit when estimating the evolutionary parameters.
Evolutionary parameters
The evolutionary parameters for the subclones can be calculated using the equations provided by Williams et al. (2018). Williams assumes an exponential tumor growth and a constant mutation rate, under which the power-law exponent equals 2, so we can use the equations with the models we fitted. In cevomod, we use the code and functions implemented in the MOBSTER package to calculate the parameters. We should not use the model for Sample 2, though. The mutation rate is severely underestimated, and the model fit is inaccurate. One should evaluate the fitted models carefully, before continuing with the analysis.
Mutation rates can be obtained with
get_mutation_rates()
, and the selection coefficients with
get_selection_coefficients()
functions. Both functions can
be run on the cevodata objects (with the correct models fitted), or on
the model tibbles directly. This allows us to manually correct the model
tibbles before the calculation of the evolutionary parameters.
For example, one can get filter Sample 2 out from the cevodata object:
cd |>
filter(sample_id != "Sample 2") |>
get_models() |>
get_mutation_rates()
#> # A tibble: 3 × 2
#> sample_id mutation_rate_williams
#> <chr> <dbl>
#> 1 Sample 1 46.6
#> 2 Sample 3 104.
#> 3 Sample 4 78.8
or from the models tibble:
cd |>
get_models() |>
filter(sample_id != "Sample 2") |>
get_selection_coefficients()
#> # A tibble: 4 × 8
#> sample_id mutation_rate_williams component N_mutations subclone_frequency
#> <chr> <dbl> <chr> <int> <dbl>
#> 1 Sample 3 104. Subclone 1 725 0.981
#> 2 Sample 3 104. Subclone 2 785 0.232
#> 3 Sample 4 78.8 Subclone 1 824 0.998
#> 4 Sample 4 78.8 Subclone 2 366 0.178
#> # ℹ 3 more variables: emergence_time <dbl>, time_end <dbl>, selection <dbl>