Strelka

December 3, 2020

Germline and somatic variant caller made by Illumina. All methods are optimized by default for whole genome DNA-Seq. RNA-Seq is still in development and not fully supported. For best somatic indel performance, Strelka is deisgned to be run with Manta. Somatic calls require a matched normal sample to make variant calls. The matached normal sample helps Strelka identify the germline variants versus somatic variants.

Manta

A structural variant and indel caller. Manta provides additional indel candidates to Strelka up to a given maximum indel size (49 by default).

Input

Strelka accepts BAM or CRAM. Reads lengths above 400bp have not been tested. The default settings in all workflows assume a whole genome DNA-Seq analysis. Input other than paired-end reads are ignored by default.

  • All input alignment and reference sequence files must contain the same chromosome names in the same order.
  • Alignments cannot contain the “=” character in the SEQ field.
  • RG (read group) tags are ignored – each alignment file must represent one sample.
  • Alignments with basecall quality values greater than 70 will trigger a runtime error (these are not supported on the assumption that the high basecall quality indicates an offset error)

Output

Put in output/strelka/results/variants/.

VCF 4.1 format.

Germline analysis is reported to the following variant files:

  • variants.vcf.gz

    • This describes all potential variant loci across all samples. Note this file includes non-variant loci if they have a non-trivial level of variant evidence or contain one or more alleles for which genotyping has been forced. Please see the multi-sample variants VCF section below for additional details on interpreting this file.
  • genome.S${N}.vcf.gz This is the genome VCF output for sample ${N}, which includes both variant records and compressed non-variant blocks. The sample index, ${N} is 1-indexed and corresponds to the input order of alignment files on the configuration command-line.

Somatic analysis provides somatic variants in the following two files:

  • somatic.snvs.vcf.gz
    • All somatic SNVs inferred in the tumor sample.
  • somatic.indels.vcf.gz
    • All somatic indels inferred in the tumor sample.

Run

Strelka is run in a two step procedure: 1) configuration and 2) execution.

In the configure step, you pass in the alignment file, reference data, output directory, and more like this:

{STRELKA_INSTALL_PATH}/bin/configureStrelkaGermlineWorkflow.py \
--bam NA12878.bam \
--referenceFasta hg19.fa \
--runDir ${STRELKA_ANALYSIS_PATH}

which creates a runWorkflow.py script with those settings in output/strelka/ using Pyflow. runWorkflow.py is then run in the execution step where you can pass in parameters like number of jobs:

{STRELKA_ANALYSIS_PATH}/runWorkflow.py -m local -j 8