Read Pre-Processing

September 15, 2020

Pre-processing is an umbrella term that can mean filtering out / trimming low quality reads and reads contaminated with adapter sequences, and / or correcting for sequencing baises such as batch or lane effects.

Sequencing Biases

  • Batch effects include any errors that occur after random fragmentation of the DNA until it is input to the flow cell.
    • e.g., PCR amplification and reverse transcription artifacts.
  • Lane effects include any errors that occur from the point at which the sample is input to the flow cell until data are output from the sequencing machine.
    • e.g., systematically bad sequencing cycles and errors in base calling.
  • Source

Ways to Trim Reads

  • Hard Trim
    • Cut off reads after certain number of bases
    • Computational easy
    • Makes all reads same length
  • Soft Trim
    • Trimming that considers quality scores in a sliding window that is usually ~10bp wide
    • Tools that do this
      • Sickle from UC Davis Bioinformatics

Sequence Duplicaiton

  • Duplicates should be remeved
  • There are two ways duplication can happen:
    • PCR duplication from uneven amplification
    • Optical duplication
      • One sequencing colony can look like two colonies to the sequencing machine
      • Illumina made this less of an issue on the NovaSeq by adding millions of microwells to the flowcells

Adapter trimming

  • Adapters are almost always trimmed from 3’ end instead of 5’ end. Illumina reads don’t need to have the 5’ end adapter trimmed because there is a primer sequence that binds to the 5’ end and sequencing by synthesis begins with the first base on the DNA insert.
  • Most adapter trimmers will take into account the quiality of the sequences on the 3’ end to better determine whether there is adapter sequence or not
  • It’s good practice to run FASTQC before and after trimming for comparison
  • Illumina adapter sequences

Skewer

  • 2015
  • Adapter trimmer
  • Less popular, but cited in good papers indicating quality
  • Not maintained, git issues are more or less ignored
  • Poor documentation

Trimmomatic

  • 2014
  • Adapter trimmer for Illumina reads
  • Fairly documented
  • Not actively maintained by main developer
  • Somewhat maintained by other unrelated contributors
  • Popular, praised in the community

Cutadapt

  • 2011
  • Adapter trimmer
  • Well documented
  • Actively maintained
  • Some doubts about quality in the community

Fastp

  • 2018
  • https://github.com/OpenGene/fastp
  • All around pre-processor, many utilities
  • Very fast, supposedly 2x-5x faster than other pre-processors / trimmers
  • Well documented
  • Actively maintained. 800 stars on Github, >100 issues, > 100 issues resolved
  • Hot right now

You can find at least one source for each that argues it is the best adapter / pre-processing tool. I have highest hopes for fastp.

Run fastp

Paramters to use for paired end short read sequencing.

  • –adapter_sequence
  • –adapter_sequence_r2
  • —-in1
  • —-in2
  • —-out1
  • —-out2
  • —-unpaired1
  • —-unpaired2