4. clindet workflow#
clindet post-processess outputs of cancer variant calling analysis pipelines from BAM from Tumor-Normal paired (Tumor-only model) and generates reports for researchers and curators at UMCCR.
It takes as input results from the UMCCR DRAGEN Tumor/Normal and DRAGEN Germline variant calling workflows:
BAM files from both samples
somatic small variant calls
germline small variant calls
somatic structural variant calls
4.1. Small variants (SNVs/Indels) (Somatic)#
This part of the workflow filters and prioritises small somatic variant calls. The idea of filtering is to remove most of the artefacts and germline leakage, but at the same time to be permissive towards known clinically important sites even if the variants are of low quality.
4.1.1. Summary#
Call candidate variants using multiple softwares.
Keep variants that pass DRAGEN’s default quality control thresholds.
Keep variants that are in the auto/sex/mito chromosomes (1-22, X, Y, M).
Rescue variants that are in hotspot sites according to SAGE.
Annotate variants with info from databases/files.
If the sample is hypermutated (i.e. has > 500K variants):
remove non-hotspot variants with
gnomad_AF >= 0.01
remove variants not overlapping gene regions (specified in a BED file)
Filter variants based on certain thresholds, for example:
Keep those with PCGR Tiers 1/2, known hotspots
Dump those with low AF (
TUMOR_AF < 0.1
)
4.1.2. Details#
Steps are:
Extract passing calls (with
PASS
in FILTER)Extract calls on main chromosomes (chr1-chr22, chrX, chrY, chrM)
Run SAGE v1.0 and add the result to the VCF. SAGE is a low-frequency variant caller with a high precision, created by Hartwig Medical Foundation. Instead of the whole genome, it targets only coding regions for inframe indels, and known hotspot sites from the following list:
Annotate the VCF against the reference sources:
High-confidence regions from the Genome in a Bottle benchmark;
GnomAD whole genome “common” variants (max population frequency > 1%);
Low complexity regions (LCR):
Homopolymers, STRs, VNTRs and other repetitive sequences, compiled from TRDB,
Regions compiled by Heng Li from 3 separate masks: low-complexity regions by mDUST and from UCSC repeatMasker plus flanking regions, structural mask (HWE+depth mask from 1000g plus flanking regions), and 75bp mappability mask.
Low and high-GC regions (regions < 30% or > 65% GC content), compiled by GA4GH
Bad promoter regions (compiled by GA4GH: “Anecdotal results suggested that many transcription start sites or first exons in the human genome tend to have poor coverage. By a systematic analysis of these regions we defined the 1,000 with the lowest relative coverage based on low coverage by an Illumina data set, which we term the ‘bad promoters’ list. The bad promoters are, like many exons, GC-rich, averaging 79% GC composition”);
Segmental duplication regions (UCSC);
UMCCR panel of normals, build from tumor-only mutect2 calls from ~200 normal samples.
If after removing non-hotspot GnomAD variants there are still > 500k somatic variants left flag the sample as highly mutated (or FFPE) and limit all calls to to cancer genes only (to avoid downstream R performance problems).
Standardise the VCF fields: add new
INFO
fieldsTUMOR_AF
,NORMAL_AF
,TUMOR_DP
,NORMAL_DP
,TUMOR_VD
,NORMAL_VD
(for use with PCGR), andAD FORMAT
field (for use with PURPLE).Run PCGR to annotate VCF against the external sources, and classify by tiers based on annotations and functional impact. At the end, this step adds
INFO
fields into the VCF:TIER
,SYMBOL
,CONSEQUENCE
,MUTATION_HOTSPOT
,INTOGEN_DRIVER_MUT
,TCGA_PANCANCER_COUNT
,CLINVAR_CLNSIG
,ICGC_PCAWG_HITS
,COSMIC_CNT
. The list of external sources used at this step is:VEP to infer the functional impact
TCGA and ICGC recurrent variants
ClinVar - Database of variants with clinical significance
CancerMine - Literature-derived database of tumor suppressor genes/proto-oncogenes
DoCM - Database of curated mutations
CBMDB - Cancer Biomarkers database
DisGeNET - Database of gene-tumor type associations
Cancer Hotspots - Resource for statistically significant mutations in cancer
dBNSFP - Database of non-synonymous functional predictions
UniProt/SwissProt KnowledgeBase - Resource on protein sequence and functional information
Pfam - Database of protein families and domains
DGIdb - Database of targeted cancer drugs
ChEMBL - Manually curated database of bioactive molecules
Filter variants to remove putative germline variants and artefacts, but make sure to keep known hotspots/actionable variants:
Keep variants called by SAGE in known hotspots (CGI, CiViC, OncoKB) regardless of other evidence;
Keep variants PCGR TIER 1 and 2 (strong and potential clinical significance, according to ACMG standard guidelines) regardless of other evidence;
Keep all driver mutations (Intogen); mutation hotspots; ClinVar pathogenic or uncertain (to include the mixed evidence); COSMIC count >=10; TCGA pancancer count >=5; ICGC PCAWG count >= 3 (all annotated by PCGR), regardless of other evidence;
For all other variants, apply the following LCR, PoN, depth and AF filters. Remove variants for which one or more of the following conditions apply:
AF<10%
,Common variant in GnomAD (max
population AF>=1%
), add into the germline set (see below);Present in >=5 samples of the Panel of Normal set;
InDel in a “bad promoter” regions (GA4GH: “Anecdotal results suggested that many transcription start sites or first exons in the human genome tend to have poor coverage. By a systematic analysis of these regions we defined the 1,000 with the lowest relative coverage based on low coverage by an Illumina data set, which we term the ‘bad promoters’ list. The bad promoters are, like many exons, GC-rich, averaging 79% GC composition);
Overlapping the ENCODE blacklist,
Variant depth
VD<4
;Variant depth
VD<6
, and the variant overlaps a low complexity region (see step 4 above);VarDict strand-biased variants (single strand support for ALT, while REF has both; or REF and ALT have opposite supporting strands), unless supported by all other callers.
Report passing variants using PCGR, classified by the ACMG tier system.
4.2. SNPs and small indels (Germline)#
The idea is to simply bring germline variants in cancer predisposition genes:
Take passing “ensemble” germline VCF from bcbio. “Ensemble” has variants supported by at least 2 of 3 callers (we use strelka2, vardict, and GATK Haplotype Caller).
Add back variants from somatic calling filtered as common GnomAD.
Subset variants to a list of ~200 cancer predisposition genes, which is build by CPSR from 3 curated sources: TCGA pan-cancer study, COSMIC CGC, and Norwegian Cancer Genomics Consortium.
Report variants using CPSR, which classifies variants in the context of cancer predisposition by overlapping with ClinVar pathogenic and VUS variants and GnomAD rare variants. It also ranks variants according pathogenicity score by ACMG and cancer-specific criteria. When a variant is annotated as having multiple functional depending on a context of a gene and a trascript, a higher impact events are prioritised, and if all things equal, APPRIS principal transcripts are preferred. In any case, one event per variant is reported. Events are reported in 4 tiers according to their significance:
ClinVar variants: pre-classified variants according to a five-level tier scheme (Pathogenic to Benign):
Non-ClinVar variants: classified by CPSR according to a five-level tier scheme (Pathogenic to Benign)
Secondary findings (optional) - pathogenic ClinVar variants in the ACMG recommended list for reporting of incidental findings,
GWAS hits (optional) - variants overlapping with previously identified hits in genome-wide association studies (GWAS) of cancer phenotypes (i.e. low to moderate risk conferring alleles), using NHGRI-EBI Catalog of published genome-wide association studies as the underlying source. The unclassified non-ClinVar variants are assigned a pathogenicity level based on the aggregation of scores according to previously established ACMG criteria. The ACMG criteria includes cancer-specific criteria, as outlined and specified in several previous studies.
4.3. Structural variants#
The idea is to report gene fusions, exon deletions, high impact and LoF events in tumor suppressors, and prioritise events in cancer genes.
Start with the somatic SV VCF from DRAGEN (called internally by a private Manta version). Keep only PASS variants.
Annotate variants:
with SnpEff.
with VEP
Use the Ensembl gene model and Sequence ontology terminology.
Subset annotations to APPRIS principal transcripts, keeping one main isoform per gene.
Prioritise variants with umccr/vcf_stuff. on a 4 tier system - 1 (high) - 2 (moderate) - 3 (low) - 4 (no interest):
exon loss
on cancer gene list (1)
other (2)
gene fusion
paired (hits two genes)
on list of known pairs (1) (curated by HMF)
one gene is a known promiscuous fusion gene (1) (curated by HMF)
on list of FusionCatcher known pairs (2)
other:
one or two genes on cancer gene list (2)
neither gene on cancer gene list (3)
unpaired (hits one gene)
on cancer gene list (2)
others (3)
upstream or downstream (a specific type of fusion, e.g. one gene is got into control of another gene’s promoter and get over-expressed (oncogene) or underexpressed (tsgene))
on cancer gene list genes (2)
LoF or HIGH impact in a tumor suppressor
on cancer gene list (2)
other TS gene (3)
other (4)
If the number of events is over 50K (e.g. FFPE), progressively remove TIER 4/3/2 SVs (in that order).
Refine SVs using Hartwig’s break-point-inspector, which locally re-assembles SV loci to get more accurate breakpoint positions and AF estimates.
Filter low-quality calls:
keep PASS FILTER
dump TIER 3/4 where
SR < 5 & PR < 5
dump TIER 3/4 where
SR < 10 & PR < 10 & (AF0 < 0.1 | AF1 < 0.1)
Use filtered variants as a guidance for PURPLE CNV calling (see below). PURPLE will adjust and recover breakpoints at copy number transitions, and adjust AF based on copy number, purity and ploidy.
Run umccr/vcf_stuff on the PURPLE output SVs.
Report tiered variants in the UMCCR cancer report.
4.4. Copy Number Variants#
The idea is to report significant CNV changes in key cancer genes and disruptions in tumor suppressors. And also calculate sample purity and ploidy profile.
We almost entirely rely on Hartwig’s PURPLE workflow in this step. The PURPLE pipeline outlines as follows:
Calculate B-allele frequencies (BAF) using AMBER subworkflow,
Calculate read depth ratios using COBALT subworkflow,
Perform segmentation (uses structural variant breakpoints for better guidance),
Estimate the purity and copy number profile (uses somatic variants for better fitting),
Plot a circos plot that visualises the CN/ploidy profile, as well as somatic variants, SVs, and BAFs,
Rescue structural variants in copy number transitions and filter single breakends,
Estimate overall tumor samples purity range,
Determine gender,
Report QC status of the sample, that will fail if the structural variants do not correspond to CN transitions, and gender is inconsistently called from BAFs and from the coverage.
From the PURPLE output, we report in the cancer report:
Circos plot
Minimal and maximal copy numbers in key cancer genes, that indicate amplifications/deletions as well as CN transitions that should match SVs
QC status
We also use Purity to adjust coverage reporting thresholds.
Genome-wide CNV segments with breakpoint information. Includes segment CN, minor/major allele ploidy, type of SV support at start/end of segment, CN determination method, BAF/BAF count, GC%, Cobalt window coverage
4.5. RNA-seq#
4.6. MultiQC#
MultiQC aggregates QC from different tools. We report the following:
Sample contamination level (for both tumor and normal) and T/N concordance (by Conpair),
Ancestry and sex (by Peddy),
Mapping QC: the number of mapped reads, paired reads, secondary or duplicated alignments, average coverage (using samtools stats and mosdepth in bcbio),
Viral content and integration sites (oncoviruses),
PURPLE stats: QC, sex, purity/ploidy, TMB and MSI statuses
Number of pre- and post-filtered SNPs and indels (by clindet) which indicates germline leakage,
Coverage profile by goleft,
Variants QC for filtered germline and somatic variants (by bcftools),
Reads QC (by FastQC).
We also include reference “good” samples in the background for comparison.
4.7. Coverage#
The idea is to see if we can miss variants due to abnormal coverage (e.g. because of copy numbers or abnormal GC).
Run
goleft indexcov
to plot fast global coverage overview from a BAM or CRAM index.Run
mosdepth
to calculate quantized coverage in exons of cancer genes if interest, using 4 groups for quantization: NO_COVERAGE, LOW_COVERAGE, CALLABLE, HIGH_COVERAGE. For tumor, the thresholds are adjusted by average purity (as reported by PURPLE). The low coverage threshold is 12x divided by purity (the minimal coverage to call a pure heterozugous variant), the high coverage threshold is 100 divided by purity (suspiciously high coverage, might indicate an LCR, a repeat, or a CN).Run CACAO using the same thresholds to calculate coverage in loci of interest. For germline variants, use ClinVar predisposition variants. For somatic variants, use CiViC actionable variants and cancerhotspots.org somatic hotspots. This step generates two HTML reports (one for somatic, one for germline variants).
4.8. Reports#
clindet produces 6 reports:
PCGR containing small somatic variants (SNPs and indels) classified according to ACMG guidelines, and MSI status of the sample.
CPSR containing small germline variants (SNPs and indels) in cancer predisposition genes, ranked by ACMG guidelines and cancer-specific criteria.
CACAO for tumor sample, reporting coverage for clinically actionable and pathogenic loci in cancer
CACAO for normal sample, reporting coverage in likely pathogenic variants cancer predisposition protein-coding genes
MultiQC report with QC stats and plots
UMCCR cancer report containing:
Somatic mutation profile (global and in cancer genes),
Mutational signatures (by the MutationalPatterns R package),
Structural variants,
Copy number variants,
PURPLE QC status,
Circos plot,
Oncoviral content and integration sites.
4.9. Key cancer genes#
For reporting and variant prioritization, we prepared a UMCCR cancer key genes set. It has been built off several sources:
Cancermine with at least 2 publication with at least 3 citations,
NCG known cancer genes,
Tier 1 COSMIC Cancer Gene Census (CGC),
CACAO hotspot genes (curated from ClinVar, CiViC, cancerhotspots),
At least 2 matches in the following 5 sources and 8 clinical panels:
Cancer predisposition genes (CPSR list)
COSMIC Cancer Gene Census (tier 2)
AZ300
Familial Cancer
OncoKB annotated
MSKC-IMPACT
MSKC-Heme
PMCC-CCP
Illumina-TS500
TEMPUS
Foundation One
Foundation Heme
Vogelstein
The result is a list of 1248 genes.
4.10. Snakemake Rules#
4.10.1. somatic.smk
#
run_sage
Runs
sage-2.2.jar
(jar included). Don’t actually think this gets run by default, probably was used in a project-specific context.input: BAMs (T/N),
hotspots_vcf
,coding_bed
,high_conf_bed
output:
work/{batch}/small_variants/sage2/{batch}.vcf.gz
somatic_vcf_pass_sort
Filters to only PASS and sorts
input:
batch_by_name[wc.batch].somatic_vcf
or above output VCFoutput:
work/{batch}/small_variants/pass_sort/{batch}-somatic.vcf.gz
somatic_vcf_select_noalt
Filters to variants within noalt regions (chr1-chr22, chrX, chrY, chrM)
input: above output VCF,
noalts_bed
output:
work/{batch}/small_variants/noalt/{batch}-somatic.vcf.gz
somatic_vcf_sage1
Runs
sage v1.0
sage script: umccr/vcf_stuff
sage Snakemake subworkflow: umccr/vcf_stuff
jar: umccr/vcf_stuff
input: above output (noalt) VCF, BAMs (T/N)
output:
work/{batch}/small_variants/sage1/{batch}-somatic.vcf.gz
{batch}/small_variants/sage1/{batch}-sage.vcf.gz
Snakemake subworkflow:
run_sage
sage v1.0 with T/N BAMs, hotspots, coding regions,
output:
work/call/{SAMPLE}-sage.vcf.gz
sage_rename_anno
sed
HOTSPOT -> SAGE_HOTSPOT
output:
work/rename_anno/{SAMPLE}-sage.vcf.gz
sage_reorder_samples
reorder samples, tumor first
output:
work/sage_reorder_samples/{SAMPLE}-sage.vcf.gz
(final Sage VCF)
sage_pass
keep PASS
output:
work/sage_pass/{SAMPLE}-sage.vcf.gz
sage_pass_novel
intersect Sage PASS VCF with noalt VCF, keep only those ‘novel’ variants found in Sage
output:
work/sage_pass_novel/{SAMPLE}-sage.vcf.gz
add_novel_sage_calls
concatenate above ‘novel’ with noalt VCF
output:
work/add_novel_sage_calls/{SAMPLE}.vcf.gz
sort_saged
sort above
output:
work/sort_saged/{SAMPLE}.vcf.gz
annotate_from_sage
input: above output VCF ‘vcf’ and final Sage VCF (‘sage’)
output:
work/annotate_from_sage/{SAMPLE}.vcf.gz
iterate through ‘sage’ variants and create a
sage_calls
dict with chr/pos/ref/alt keys and the cyvcf2 record as values, then for ‘vcf’ variants that are PASSed, annotate ‘SAGE_HOTSPOT’ and use a ‘PASS’ FILTER, else use the ‘SAGE_lowconf’ FILTER tag. Then set theFORMAT/DP
andFORMAT/AD
based on the ‘sage’ call. Something like that.
copy_result
copy above output VCF to
work/{batch}/small_variants/sage1/{batch}-somatic.vcf.gz
sage
copy
work/call/{SAMPLE}-sage.vcf.gz
to{batch}/small_variants/sage1/{batch}-sage.vcf.gz
somatic_vcf_annotate
input: above (work) output file if
batch_by_name[wc.batch].somatic_vcf
, else thenoalt
output:
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
work/{batch}/small_variants/somatic_anno/subset_highly_mutated_stats.yaml
Runs
anno_somatic_vcf
fromvcf_stuff
Snakemake subworkflow:
prep_hmf_hotspots
Filters HMF hotspots file from 17,875 in total down to 10,209 HMF variants
input:
hotspots
reference fileoutput:
somatic_anno/hmf_hotspot.vcf.gz
prep_anno_toml
Prepares TOML file for use with vcfanno
input: lots of reference files
output:
somatic_anno/tricky_vcfanno.toml
somatic_vcf_regions_anno
Runs vcfanno
input: TOML from above and the original (
sage1
) input VCFoutput:
somatic_anno/vcfanno/{SAMPLE}-somatic.vcf.gz
maybe_subset_highly_mutated
input: above output vcfanno’d VCF
output:
somatic_anno/subset/{SAMPLE}-somatic.vcf.gz
,stats.yaml
Count total vars
If that’s
<= 500K
, all good, just copy that to the next stepIf that’s
> 500K
:grab the
INFO/gnomad_AF
fieldif that exists, is
>= 0.01
, and is not a HMF/SAGE HOTSPOT, discard it
Write the count stats into the yaml output
somatic_vcf_clean_info
input: above output subset VCF
output:
somatic_anno/clean_info/{SAMPLE}-somatic.vcf.gz
process:
Add a
TRICKY
INFO field in the VCF metadata (proc_hdr
func)Remove
TRICKY_*
andANN
fields from VCF metadata (postproc_hdr
func)Remove
ANN
andTRICKY_*
INFO fields from variants, but join theTRICKY_*
fields into a single pipe separated field underINFO/TRICKY
.This basically reduces the VCF size substantially.
somatic_vcf_prep
input: above output cleaned VCF
output:
somatic_anno/prep/{SAMPLE}-somatic.vcf.gz
annotate
TUMOR/NORMAL_
-AF/DP/VD
fields for PCGR
sage_pon
input: above output VCF,
hmf_pon
VCF withPON_COUNT
/PON_MAX
/PON_TOTAL
columns for annotationoutput:
somatic_anno/sage_pon/{SAMPLE}-somatic.vcf.gz
annotates VCF with
SageGermlinePon.hg38.98x.vcf.gz
HMF PoN counts
somatic_vcf_pon_anno
input: above output VCF, and
panel_of_normals_dir
(withpon.snps.vcf.gz
andpon.indels.vcf.gz
)output:
somatic_anno/pon/{SAMPLE}-somatic.vcf.gz
annotates with
INFO/PoN_CNT
, optionallyINFO/PoN_CNT>=filter_hits with FILTER=PoN
somatic_vcf_pcgr_round1
input: above output VCF, and PCGR reference data (
pcgr_data
)output:
somatic_anno/pcgr_run/{SAMPLE}-somatic.pcgr_ready.vep.vcf.gz
somatic_anno/pcgr_run/{SAMPLE}-somatic.pcgr.snvs_indels.tiers.tsv
Runs PCGR (first time) via umccr/clindet
somatic_vcf_pcgr_anno
input:
above output VCF + tiers files
annotated VCF from
somatic_vcf_pon_anno
step
output:
somatic_anno/pcgr_ann/{SAMPLE}-somatic.vcf.gz
annotate with
PCGR_
-SYMBOL
,TIER
,CONSEQUENCE
,MUTATION_HOTSPOT
,PUTATIVE_DRIVER_MUTATION
,TCGA_PANCANCER_COUNT
,CLINVAR_CLNSIG
, andCOSMIC_CNT
,ICGC_PCAWG_HITS
,CSQ
annotate
input: above output VCF
output:
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
simply copies input to output
somatic_vcf_filter
Runs
filter_somatic_vcf
fromvcf_stuff
input: above final output VCF (
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
)output:
{batch}/small_variants/{batch}-somatic.vcf.gz
steps:
Keep where
INFO/PCGR_TIER in [1, 2]
Keep where
INFO/SAGE_HOTSPOT == 'known'
Keep where
INFO/TIER == 'HOTSPOT'
Dump where
INFO/TUMOR_VD < 4
(too few reads support variant)Dump where
INFO/PoN_CNT >= 5
(in PoN, likely germline or artefact)Report as Germline if is otherwise PASSed and
INFO/TUMOR_AF >= 0.2
See the
germline_leakage
rule ingermline.smk
Keep potential hotspots:
HMF_HOTSPOT
PCGR_INTOGEN_DRIVER_MUT
PCGR_MUTATION_HOTSPOT
PCGR_CLINVAR_CLNSIG
is ‘pathogenic’ or ‘uncertain’COSMIC_CNT >= 10
PCGR_TCGA_PANCANCER_COUNT >= 5
ICGC_PCAWG_HITS >= 5
Dump where
INFO/TUMOR_AF < 0.1
Dump where has
INFO/ENCODE
flag (hits ENCODE blocklist)Dump indels in homopolymers (
INFO/MSILEN
andINFO/MSI
plus formula)Dump where
INFO/TUMOR_VD < 6
and:variant in low complexity region (LCR)
no
INFO/HMF_GIAB_CONF
flagINFO/HMF_MAPPABILITY < 0.9
Dump indels in bad promoters
Dump strand biased variants based on Vardict
(DRAGEN) Dump where
INFO/TLOD < 15
Dump where
INFO/gnomAD_AF >= 0.01
Report as Germline if is otherwise PASSed
See the
germline_leakage
rule ingermline.smk
somatic_vcf_filter_pass
Keep only PASS from above filtered VCF
input: above output VCF
output:
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
The
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
VCF is also passed as input to:Pierian
vcf2maf.pl
bcftools stats
PCGR
4.10.2. pcgr.smk
#
run_pcgr
run PCGR with purity and ploidy as inferred by PURPLE
uses the
scripts/pcgr
wrapperraw PCGR outputs get renamed to remove the
_acmg.hg38
suffix
input:
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
PCGR reference data (
pcgr_data
)work/{batch}/purple/{batch}.purple.purity.tsv
output:
HTML:
work/{batch}/pcgr/{batch}-somatic.pcgr.html
VCF:
work/{batch}/pcgr/{batch}-somatic.pcgr.pass.vcf.gz
TSV:
work/{batch}/pcgr/{batch}-somatic.pcgr.snvs_indels.tiers.tsv
pcgr_copy_report
copy HTML and TSV (not VCF) into final
clindetd/{batch}/
directoryinput: above HTML and TSV outputs
output:
{batch}/{batch}-somatic.pcgr.html
{batch}/small_variants/{batch}-somatic.pcgr.snvs_indels.tiers.tsv
PCGR notes
Step 0: Validate input data and options
Runs
scripts/pcgr_validate_input.py
input: raw VCF input to PCGR
output:
pcgr_ready.vcf.gz
decomposed with no FORMAT or sample columns
Step 1: VEP annotation
Runs
vep
with lots of optionsinput:
pcgr_ready.vcf.gz
output:
pcgr_ready.vep.vcf.gz
Step 2: vcfanno precision oncology
Runs
pcgr_vcfanno.py
input:
pcgr_ready.vep.vcf.gz
output:
pcgr_ready.vep.vcfanno.vcf.gz
Step 3: Cancer gene annotations with pcgr-summarise
pcgr_summarise.py
vcf2tsv.py
Step 4: Generation of outputs/reports
Uses
pcgrr
functions via thepcgrr.R
CLI
4.10.3. structural.smk
#
sv_keep_pass
Reset INFO (
SIMPLE_ANN
,SV_HIGHEST_TIER
) and FILTER (Intergenic
,MissingAnn
,REJECT
) fields. Keep only PASS variants.input: final Manta VCF from DRAGEN tumor/normal workflow
output:
work/{batch}/structural/keep_pass/{batch}-manta.vcf.gz
sv_snpeff_maybe
Run snpEff if there is not an
INFO/ANN
annotation.use
-hgvs
,-cancer
,-csvStats
,-s
,-dataDir
optionselse, just copy input to output
input:
above output VCF
snpEff database
output:
work/{batch}/structural/snpeff/{batch}-sv-snpeff.vcf.gz
sv_vep
Run VEP on final Manta VCF
input:
final Manta VCF from DRAGEN tumor/normal workflow
PCGR data directory (for the VEP reference data)
FASTA file:
grch38/.vep/homo_sapiens/105_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
output:
work/{batch}/structural/vep/{batch}-sv-vep.vcf.gz
fix_snpeff
Replace the wrongly capitalised ALT fields with lowercase to avoid downstream bugs
TODO: turn the multiple
sed
commands into a single one with-e
.
input: snpEff output VCF
output:
work/{batch}/structural/snpeff/{batch}-sv-snpeff-fix.vcf.gz
sv_prioritize
Run umccr/vcf_stuff
input: above fixed snpEff VCF
output:
work/{batch}/structural/prioritize/{batch}-sv-eff-prio.vcf.gz
sv_subsample_if_too_many
If >= 50,000 SVs, progressively remove TIER 4/3/2 SVs (in that order).
TODO: compress output VCF
input: above sv prioritised VCF
output:
work/{batch}/structural/sv_subsample_if_too_many/{batch}-manta.vcf
sv_bpi_maybe
Run BPI on above output VCF
input: above output VCF, T/N BAMs
output:
work/{batch}/structural/maybe_bpi/{batch}-manta.vcf
filter_sv_vcf
Filters in sequence:
keep PASS FILTER
dump TIER 3/4 where
SR < 5 & PR < 5
dump TIER 3/4 where
SR < 10 & PR < 10 & (AF0 < 0.1 | AF1 < 0.1)
TODO: reorg this
input: above BPI output VCF
output:
work/{batch}/structural/filt/{batch}-manta.vcf.gz
This is fed to PURPLE.
reprioritize_rescued_svs
Run umccr/vcf_stuff after PURPLE. Also remove
INFO/ANN
annotation.input:
purple SVs:
work/{batch}/purple/{batch}.purple.sv.vcf.gz
output:
work/{batch}/structural/sv_after_purple/{batch}-manta.vcf.gz
copy_sv_vcf_ffpe_mode
Copy either the filtered or the purple prioritised to the final output
input: filtered or purple prioritised VCF
output:
{batch}/structural/{batch}-manta.vcf.gz
prep_sv_tsv
input: above final VCF
output:
{batch}/structural/{batch}-manta.tsv
Parse info from the VCF and export to TSV
give non-tiered variants an
INFO/SV_TOP_TIER
of4
.set
simple_ann = INFO/SIMPLE_ANN
, andPURPLE_status = ''
for purple inferred variants:
simple_ann = {INFO/SVTYPE}||||From_CNV|{tier}
PURPLE_status = 'INFERRED'
for purple recovered variants:
PURPLE_status = 'RECOVERED'
there are 23 columns:
caller = 'manta'
,sample = tumor name
,chrom = CHROM
,start = POS
,end = INFO/END
svtype = INFO/SVTYPE
split_read_support, paired_support_PE/PR = FORMAT/SR-PE-PR for tumor
AF_BPI = INFO/BPI_AF
,somaticscore = INFO/SOMATICSCORE
tier = INFO/SV_TOP_TIER
(or4
if missing)annotation = INFO/SIMPLE_ANN
PURPLE_
:AF
,CN
,CN_CHANGE
,PLOIDY
,STATUS
(INFERRED
orRECOVERED
)START/END_BPI = BPI_START/END
ID = ID
MATEID = INFO/MATEID
ALT = ALT[0]
4.10.4. oncoviruses.smk
#
viral_content
input:
tumor BAM
refdata
output1:
prioritized_oncoviruses.tsv
## Viral sequences (from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) found in unmapped reads
## Sample: MDX230148
## Minimal completeness: 50.0% at 1x or 1000bp at 5x
#virus size depth 1x 5x 25x significance
HPV45 7858 67.1 0.768 0.767 0.767 significant
HCV-2 9711 33.0 0.012 0.011 0.010 .
HPV71 8037 20.8 0.011 0.011 0.009 .
HCV-1 9646 6.3 0.011 0.010 0.010 .
HPV19 7685 1.9 0.010 0.009 0.009 .
[...]
output2:
present_viruses.txt
grab the
significant
viruses from above
HPV45
command:
oviraptor \
{tumor_bam} \
-o {work_dir} \
-s {tumor_name} \
--genomes-dir {refdata} \
--only-detect # just detect without finding integration sites, do that in next step
viral_integration_sites
input:
tumor BAM
present_viruses.txt
output from above
output1:
{batch}/oncoviruses/oncoviral_breakpoints.vcf
chr7 19105901 3_1 N N[HPV45:3062[ . PASS SVTYPE=BND;STRANDS=+-:44;EVENT=3;MATEID=3_2;CIPOS=-3,2;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=44;PE=34;SR=10;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:44:34:10
chr7 19105936 4_1 N ]HPV45:1236]N . PASS SVTYPE=BND;STRANDS=-+:91;EVENT=4;MATEID=4_2;CIPOS=0,0;CIEND=-10,3;CIPOS95=0,1;CIEND95=0,0;IMPRECISE;SU=91;PE=64;SR=27;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:91:64:27
chr13 98758355 5_1 N N]HPV45:5033] . PASS SVTYPE=BND;STRANDS=++:5;EVENT=5;MATEID=5_2;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=5;PE=0;SR=5;GenesWithin100kb=SLC15A1,DOCK9-AS1 GT:SU:PE:SR ./.:5:0:5
HPV45 1237 4_2 N N[chr7:19105935[ . PASS SVTYPE=BND;STRANDS=+-:91;SECONDARY;EVENT=4;MATEID=4_1;CIPOS=-10,3;CIEND=0,0;CIPOS95=0,0;CIEND95=0,1;IMPRECISE;SU=91;PE=64;SR=27;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:91:64:27
HPV45 3063 3_2 N ]chr7:19105900]N . PASS SVTYPE=BND;STRANDS=-+:44;SECONDARY;EVENT=3;MATEID=3_1;CIPOS=-10,9;CIEND=-3,2;CIPOS95=0,0;CIEND95=0,0;SU=44;PE=34;SR=10;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:44:34:10
HPV45 5034 5_2 N N]chr13:98758354] . PASS SVTYPE=BND;STRANDS=++:5;SECONDARY;EVENT=5;MATEID=5_1;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=5;PE=0;SR=5;GenesWithin100kb=SLC15A1,DOCK9-AS1 GT:SU:PE:SR ./.:5:0:5
command:
oviraptor \
{tumor_bam} \
-o {work_dir} \
-s {tumor_name} \
--genomes-dir {genomes_dir} \
-v $(cat present_viruses.txt)
oncoviruses_breakpoints_tsv
input:
{batch}/oncoviruses/oncoviral_breakpoints.vcf
output from abovework/{batch}/oncoviruses/present_viruses.txt
output from above-above
command:
Parse VCF from above, grab FORMAT PE/SR fields and add to
PAIR_COUNT
output1:
work/{batch}/oncoviruses/oncoviral_breakpoints.tsv
# A tibble: 6 × 10
sample contig start end svtype PAIR_COUNT DisruptedGenes UpstreamGenes ID MATEID
<chr> <chr> <dbl> <lgl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 MDX230148 chr7 19105901 NA BND 44 TWIST1 AC003986.2, AC003986.3, AC003986.1 3_1 3_2
2 MDX230148 chr7 19105936 NA BND 91 TWIST1 AC003986.2, AC003986.3, AC003986.1 4_1 4_2
3 MDX230148 chr13 98758355 NA BND 5 NA SLC15A1, DOCK9-AS1 5_1 5_2
4 MDX230148 HPV45 1237 NA BND 91 NA . 4_2 4_1
5 MDX230148 HPV45 3063 NA BND 44 NA . 3_2 3_1
6 MDX230148 HPV45 5034 NA BND 5 NA . 5_2 5_1
4.10.4.1. oviraptor - umccr/oviraptor#
extract_unmapped_and_mate_unmapped_reads
input: tumor BAM
output:
step1_host_unmapped_or_mate_unmapped.namesorted.bam
command:
sambamba view -F 'unmapped or mate_is_unmapped' ... | samtools sort -n ...
unmapped_and_mate_unmapped_reads_to_fastq
input: output BAM from above rule
output:
step2_host_unmapped_or_mate_unmapped.R1.fq
step2_host_unmapped_or_mate_unmapped.R2.fq
step2_host_unmapped_or_mate_unmapped.single.fq
command:
samtools fastq input_bam ...
bwa_unmapped_and_mate_unmapped_reads_to_gdc
input:
paired FASTQs from above
gds viral FASTA (this is where EBV gets lost; see umccr/clindet#143)
output:
detect_viral_reference/host_unmapped_or_mate_unmapped_to_gdc.bam
command:
minimap2 \
-ax sr \
-Y \
-t{threads} \
-R '{params.rg}' \
{input.gdc_fa} {input.fq1} {input.fq2} \
| samtools sort -@{threads} -Obam -o {output.gdc_bam}
index_virus_bam
Just
samtools index
above output BAM
mosdepth
Runs mosdepth using the GDS viral FASTA index
output1:
thresholds.bed.gz
#chrom start end region 1X 5X 25X
CMV 0 235646 unknown 27 0 0
HBV 0 3215 unknown 0 0 0
HCV-1 0 9646 unknown 104 98 94
HCV-2 0 9711 unknown 116 104 98
HIV-1 0 9181 unknown 0 0 0
HIV-2 0 10359 unknown 0 0 0
KSHV 0 137969 unknown 28 27 0
HTLV-1 0 8507 unknown 0 0 0
MCV 0 5387 unknown 0 0 0
[...]
output2:
regions.bed.gz
CMV 0 235646 0.00
HBV 0 3215 0.00
HCV-1 0 9646 6.28
HCV-2 0 9711 33.05
HIV-1 0 9181 0.00
HIV-2 0 10359 0.00
KSHV 0 137969 0.00
HTLV-1 0 8507 0.00
MCV 0 5387 0.00
SV40 0 5243 0.00
[...]
prioritize_viruses
input: above mosdepth output files
output:
prioritized_oncoviruses.tsv