BWA.aln (v2) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

A fast and accurate short-read alignment tool that allows for mismatches and gaps. Alignments are output in a SAM format file, which provides Phred-scale quality scores for each alignment.

Author: Heng Li, Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version: BWA 0.7.4

Summary

Burrows-Wheeler Aligner (BWA.aln) is a fast, light-weight tool that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It works for query sequences shorter than 200bp, and does gapped alignment. BWA.aln is usually faster and more accurate on queries with low error rates.

This document is adapted from the BWA documentation for release 0.7.4.  For more information about BWA.aln, see the BWA project site. BWA.aln was developed at the Wellcome Trust Sanger Institute and the Broad Institute.

Note: Index files created with BWA version 0.5.x or earlier are not compatible with the aligners of version 0.6.x and newer.  Likewise, the BWA 0.6.x and newer index files are not compatible with the 0.5.x aligners.  The BWA 0.7.x aligners are able to use index files created with 0.6.x, however.

Speed

Speed of alignment is largely determined by the error rate of the query sequences, faster with near-perfect hits and slower for higher error rates. Pairing is slower for shorter reads, mostly because shorter reads have more spurious hits.

In experimental runs, BWA was able to map 2 million 32bp reads to:

  • a bacterial genome in several minutes
  • the human X chromosome in 8-15 minutes
  • the human genome in 15-25 minutes

References

BWA manual page: http://bio-bwa.sourceforge.net/bwa.shtml.

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009;25:1754-1760. [PMID: 19451168] (http://www.ncbi.nlm.nih.gov/pubmed/19451168)

Parameters

Name Description
BWA index * A BWA index. You can select from a list of hosted indexes or provide a custom index in the form of a ZIP bundle (as generated by the BWA.indexer module).
reads pair 1 * Single-end or first paired-end reads file in FASTA, FASTQ, or BAM format.  For paired-end data, this should be the forward ("*_1" or "left") input file.  Note: the FASTA or FASTQ can be gzipped.
reads pair 2 The reverse ("*_2" or "right") input file for paired-end reads in FASTA, FASTQ, or BAM format. Note: the FASTA or FASTQ can be gzipped.
bam mapping Specifies how to map BAM input. This is only required if input file is in BAM format.
max edit distance The max edit distance. This specifies a threshold of the maximum number of deletions, insertions, and substitutions needed to transform the reference sequence into the read sequence.
max num gap Maximum number of gap opens. This specifies a threshold of the maximum number of gaps that can be initiated to match a given read to the reference.
max gap extension Maximum number of gap extensions. This specifies a threshold of the maximum number of bases by which gaps in a read can be extended.
max deletion length Disallow a long deletion within this many bp of the 3' end.
max indel length Disallow an indel within this many bp of the ends
seed length The set of bases determined by this option in the high-quality (left) end of the read is the seed.
max seed edit distance Maximum edit distance in the seed; that is, the maximum number of changes required to transform the reference sequence of the seed into the read sequence of the seed.
mismatch penalty Mismatch penalty
gap open penalty Gap open penalty.  The gap open penalty is the score taken away for the initiation of the gap in sequence. To make the match more significant you can try to make the gap penalty larger.
gap extension penalty Gap extension penalty.  The gap extension penalty is added to the standard gap penalty for each base or residue in the gap.  To reduce long gaps, increase the extension gap penalty. A few long gaps are expected, rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. (The exception to this rule is where one or both sequences are single reads with possible sequencing errors, in which case many single base gaps are expected. To cope with this, try setting the gap open penalty very low and using the gap extension penalty to control gap scoring.)
max best hits Proceed with suboptimal alignments if there are no more than this many equally best hits. This option only affects paired-end mapping.  Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32 bp).
iterative search * Whether to disable iterative search.  Enabling this will slow the alignment process.
trim reads Specifies a quality threshold for read trimming.  The trimming algorithm in BWA scans from the right of the read, accumulating a penalty sum (or "area") for each position that is lower quality than this threshold and reducing this area for each position above that threshold.  The read is trimmed to the position where the penalty area is greatest.
Illumina 1 3 format * The input is in the Illumina 1.3+ read format
barcode length Length of barcode starting from the 5' end
max insert size Specifies the maximum insert size for a read pair to be considered to be mapped properly
max occurrences Specifies the maximum occurrences of a read for pairing
max alignments Maximum number of alignments to output in the XA tag for reads paired properly
max dc alignments Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons)
output prefix * Prefix to use for output file name

* - required

Input Files

  1. BWA index
    A set of BWA index files bundled as a ZIP archive, as produced by the BWA.indexer module.  The GenePattern FTP site also hosts a number of index bundels, available in a dropdown selection (requires GenePattern 3.7.0+).
    Note: these index files must have been produced by BWA version 0.6.x or 0.7.x.
  2. reads pair 1
    Single-end or first paired-end reads file in FASTA, FASTQ, or BAM format.  For paired-end data, this should be the forward ("*_1" or "left") input file.  Note: the FASTA or FASTQ can be gzipped.
  3. reads pair 2
    The reverse ("*_2" or "right") input file for paired-end reads in FASTA, FASTQ, or BAM format. Note: the FASTA or FASTQ can be gzipped.

Output Files

  1. SAM file
    The aligned sequences are output in SAM format. For more details on this alignment file, see the SAM format specification at http://samtools.sourceforge.net/SAM-1.3.pdf.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
any

Language:
C;Perl

Version Comments

Version Release Date Description
2 2014-07-17 Beta: Updated to BWA 0.7.4, changed to use dynamic FTP-hosted index files, and switched to HTML-based doc
1 2011-05-02