Lesson 32 of 34 · Sequencing and Bioinformatics

Bioinformatics for the Molecular Lab

Bioinformatics

Overview

Next-generation sequencing produces enormous numbers of short DNA reads, but a pile of reads is not a result. Before a clinician can act on a finding, those reads must be cleaned, aligned to a known reference, scanned for differences, and interpreted. That conversion of raw instrument output into reportable variants is the work of bioinformatics, and it is as much a part of the assay as the chemistry on the instrument. This lesson builds directly on next-generation sequencing, which described how massively parallel chemistries generate millions of reads; here we follow those reads through the analysis pipeline 1.

It is useful to divide the work into three stages. Primary analysis happens on the instrument: raw signals (light, voltage, or fluorescence) are converted into base calls with associated confidence values. Secondary analysis takes those base calls and produces aligned reads and a list of variants. Tertiary analysis annotates and interprets the variants so they can be reviewed and reported. The sections below trace the secondary and tertiary stages, the file formats that carry data between them, and the quality metrics that tell you whether the result can be trusted.

The File Formats, in Order

Data moves through a sequencing pipeline as a short series of standard text or binary files. Knowing what each one holds — and the order they appear in — is the fastest way to understand the whole workflow.

  • FASTQ holds the raw reads. Each read is stored as four lines: an identifier, the base sequence, a separator, and a per-base string of quality characters. FASTQ is the output of primary analysis and the input to everything downstream.
  • FASTA holds the reference genome. It is a simple sequence file — a header line followed by the reference bases — against which the patient’s reads will be compared. The reference is fixed and curated, not generated per run.
  • SAM/BAM holds the alignments. SAM (Sequence Alignment/Map) is the human-readable text form; BAM is its compressed binary equivalent. These files record where each read mapped on the reference and how well it fit.
  • VCF holds the variant calls. The Variant Call Format lists positions where the sample differs from the reference, one variant per line, with supporting evidence and quality.

The pipeline reads FASTQ, consults the FASTA reference, writes BAM, and from the BAM produces a VCF. A tiny FASTQ record makes the raw input concrete:

@RUN1:lane1:read00042            <- identifier (the read name)
GATCACAGGTCTATCACCCTATTAACCAC    <- the called bases
+                                <- separator
IIIIIIIIIFFFFFAAAA<<<<7777,,,    <- per-base quality (one char per base)

The fourth line is the key idea of the next section: every single base carries its own confidence score.

Quality Scores: Phred and Its Logarithmic Scale

A base call is only as useful as our confidence in it, and that confidence is captured by the Phred quality score, written Q. Phred is defined on a logarithmic scale relating the score to the probability P that the base call is wrong:

\[Q = -10 \times \log_{10}(P).\]

Because the scale is logarithmic, every increase of 10 in Q means a tenfold drop in error probability. The two values worth memorizing are:

  • Q20 corresponds to an error probability of 1 in 100 — about 99% accuracy for that base.
  • Q30 corresponds to an error probability of 1 in 1000 — about 99.9% accuracy.

Q30 is the common benchmark for high-quality data; a base called at Q30 is expected to be wrong only once in a thousand times. In the FASTQ file each quality character encodes one such score for the base in the same position, so quality is tracked base by base, not just per read.

Per-base quality drives the first real processing step. Reads often degrade toward their ends, and adapter sequence can be left over from library preparation. Trimming removes low-quality tails and adapter sequence, and filtering discards reads that fall below a quality threshold entirely. Cleaning the data here prevents low-confidence bases from generating spurious variant calls later 1.

The Secondary Analysis Pipeline

With clean reads in hand, secondary and tertiary analysis proceed as an ordered pipeline. Each step consumes the output of the previous one.

FASTQ  --QC/trim-->  reads  --align-->  BAM  --call-->  VCF  --annotate-->  report
 raw                 clean              mapped          variants           interpreted
  1. Quality control of reads. Trim and filter as above, so only trustworthy bases enter alignment.
  2. Alignment (mapping). Each read is placed at its best-matching location on the reference genome. Aligners must tolerate real biological differences and sequencing errors while still finding the correct position, even across repetitive regions. The result is the BAM file.
  3. Variant calling. The aligned reads are examined position by position. Where the reads consistently disagree with the reference, the caller records a variant — a single-base substitution, an insertion, or a deletion — into the VCF. Distinguishing a true variant from a sequencing artifact is a statistical judgment based on how many reads support the change and how good their quality is 1.
  4. Annotation. Raw variant positions mean little on their own. Annotation adds biological and clinical context: which gene the variant falls in, its predicted effect on the protein (for example, silent, missense, or nonsense), how common it is in the general population (its population frequency), and what curated databases say about it. A variant common in healthy populations is unlikely to cause disease; a rare variant in a relevant gene warrants closer attention. This step rests on the molecular biology of how DNA changes alter proteins, the central-dogma reasoning covered earlier in the program 2.
  5. Filtering and interpretation. Finally, variants are filtered by quality, depth, and frequency, and the survivors are reviewed by a qualified analyst against clinical criteria to decide what, if anything, is reported.

Read Depth, Coverage, and Variant Allele Fraction

A single read agreeing with or differing from the reference proves little; confidence comes from many reads covering the same place. Read depth (or coverage) at a position is simply the number of reads that span it. A position covered by 500 reads supports a far more reliable call than one covered by 5.

This is why pipelines enforce a minimum-coverage threshold: positions below it are flagged as having insufficient data rather than reported as reference. Low coverage does not mean “no variant” — it means “not enough information to say.” Equally important is uniformity of coverage: an assay that averages 500-fold depth is still unreliable if some clinically important positions receive only a handful of reads. Reporting average depth alone can hide such gaps, so labs examine coverage across every region of interest.

Depth also sets the limit of detection through the variant allele fraction (VAF), the proportion of reads carrying the variant:

\[\text{VAF} = \frac{\text{variant reads}}{\text{total reads at that position}}.\]

A germline heterozygous variant sits near a VAF of 0.5, but somatic variants in a tumor — or low-level mixtures — can appear at much smaller fractions. The lowest VAF an assay can reliably detect is bounded by its depth: you cannot observe a 1% variant if only 20 reads cover the position, because at that depth even a true variant may contribute zero or one read by chance. A worked illustration:

position chr1:100,500   depth = 1000 reads
  reference (G):  950 reads
  variant   (A):   50 reads
  VAF = 50 / 1000 = 0.05  (5%)

At depth 1000, a 5% variant is supported by ~50 reads -> detectable.
At depth 20, that same 5% variant expects ~1 read -> indistinguishable from noise.

The practical rule is that detecting low-fraction variants requires deep, even coverage; the required depth rises sharply as the target VAF falls 1.

A Laboratory-Operations Perspective

Because the pipeline is part of the assay, it is held to the same standards as any other laboratory process. A clinical bioinformatics pipeline must be validated end to end — its accuracy, limit of detection, and reproducibility established before patient use — and version-controlled, so that every reported result can be traced to the exact software and settings that produced it. Changing an aligner, a caller, or a filter setting can change which variants are reported, so such changes are documented and revalidated rather than made silently.

The reference databases used in annotation demand the same discipline. Gene definitions and the population-frequency and clinical-significance databases are updated continuously, so a lab must curate which versions it relies on and record them with each result. The broader framework for validating assays, controlling versions, and managing reference data is the subject of the laboratory-operations course; the point to carry forward here is that the bioinformatics is not an afterthought to sequencing but an integral, regulated component of the test.

Summary

Bioinformatics turns raw reads into reportable variants through an ordered pipeline: FASTQ reads are quality-controlled, aligned to a FASTA reference to produce a BAM, scanned for differences into a VCF, and then annotated and interpreted. Phred quality scores measure per-base confidence on a logarithmic scale, with Q30 meaning roughly 99.9% accuracy. Read depth and uniform coverage determine how reliable a call is and, through the variant allele fraction, set the lowest detectable variant level. Because all of this is part of the diagnostic assay, the pipeline must be validated, version-controlled, and built on curated reference data.

References

  1. Lela Buckingham. Molecular Diagnostics: Fundamentals, Methods, and Clinical Applications. 3rd ed. F.A. Davis Company. 2019. verified
  2. Bruce Alberts, Rebecca Heald, Alexander Johnson, David Morgan, Martin Raff, Keith Roberts, Peter Walter. Molecular Biology of the Cell. 7th ed. W. W. Norton & Company. 2022. verified