GenePattern - BWA.aln (v2) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

A fast and accurate short-read alignment tool that allows for mismatches and gaps. Alignments are output in a SAM format file, which provides Phred-scale quality scores for each alignment.

Author: Heng Li, Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version: BWA 0.7.4

Summary

Burrows-Wheeler Aligner (BWA.aln) is a fast, light-weight tool that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It works for query sequences shorter than 200bp, and does gapped alignment. BWA.aln is usually faster and more accurate on queries with low error rates.

This document is adapted from the BWA documentation for release 0.7.4. For more information about BWA.aln, see the BWA project site. BWA.aln was developed at the Wellcome Trust Sanger Institute and the Broad Institute.

Note: Index files created with BWA version 0.5.x or earlier are not compatible with the aligners of version 0.6.x and newer. Likewise, the BWA 0.6.x and newer index files are not compatible with the 0.5.x aligners. The BWA 0.7.x aligners are able to use index files created with 0.6.x, however.

Speed

Speed of alignment is largely determined by the error rate of the query sequences, faster with near-perfect hits and slower for higher error rates. Pairing is slower for shorter reads, mostly because shorter reads have more spurious hits.

In experimental runs, BWA was able to map 2 million 32bp reads to:

a bacterial genome in several minutes
the human X chromosome in 8-15 minutes
the human genome in 15-25 minutes

References

BWA manual page: http://bio-bwa.sourceforge.net/bwa.shtml.

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009;25:1754-1760. [PMID: 19451168] (http://www.ncbi.nlm.nih.gov/pubmed/19451168)

Parameters

Name	Description
BWA index *	A BWA index. You can select from a list of hosted indexes or provide a custom index in the form of a ZIP bundle (as generated by the BWA.indexer module).
reads pair 1 *	Single-end or first paired-end reads file in FASTA, FASTQ, or BAM format. For paired-end data, this should be the forward ("*_1" or "left") input file. Note: the FASTA or FASTQ can be gzipped.
reads pair 2	The reverse ("*_2" or "right") input file for paired-end reads in FASTA, FASTQ, or BAM format. Note: the FASTA or FASTQ can be gzipped.
bam mapping	Specifies how to map BAM input. This is only required if input file is in BAM format.
max edit distance	The max edit distance. This specifies a threshold of the maximum number of deletions, insertions, and substitutions needed to transform the reference sequence into the read sequence.
max num gap	Maximum number of gap opens. This specifies a threshold of the maximum number of gaps that can be initiated to match a given read to the reference.
max gap extension	Maximum number of gap extensions. This specifies a threshold of the maximum number of bases by which gaps in a read can be extended.
max deletion length	Disallow a long deletion within this many bp of the 3' end.
max indel length	Disallow an indel within this many bp of the ends
seed length	The set of bases determined by this option in the high-quality (left) end of the read is the seed.
max seed edit distance	Maximum edit distance in the seed; that is, the maximum number of changes required to transform the reference sequence of the seed into the read sequence of the seed.
mismatch penalty	Mismatch penalty
gap open penalty	Gap open penalty. The gap open penalty is the score taken away for the initiation of the gap in sequence. To make the match more significant you can try to make the gap penalty larger.
gap extension penalty	Gap extension penalty. The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. To reduce long gaps, increase the extension gap penalty. A few long gaps are expected, rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. (The exception to this rule is where one or both sequences are single reads with possible sequencing errors, in which case many single base gaps are expected. To cope with this, try setting the gap open penalty very low and using the gap extension penalty to control gap scoring.)
max best hits	Proceed with suboptimal alignments if there are no more than this many equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32 bp).
iterative search *	Whether to disable iterative search. Enabling this will slow the alignment process.
trim reads	Specifies a quality threshold for read trimming. The trimming algorithm in BWA scans from the right of the read, accumulating a penalty sum (or "area") for each position that is lower quality than this threshold and reducing this area for each position above that threshold. The read is trimmed to the position where the penalty area is greatest.
Illumina 1 3 format *	The input is in the Illumina 1.3+ read format
barcode length	Length of barcode starting from the 5' end
max insert size	Specifies the maximum insert size for a read pair to be considered to be mapped properly
max occurrences	Specifies the maximum occurrences of a read for pairing
max alignments	Maximum number of alignments to output in the XA tag for reads paired properly
max dc alignments	Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons)
output prefix *	Prefix to use for output file name

* - required

Input Files

BWA index
A set of BWA index files bundled as a ZIP archive, as produced by the BWA.indexer module. The GenePattern FTP site also hosts a number of index bundels, available in a dropdown selection (requires GenePattern 3.7.0+).
Note: these index files must have been produced by BWA version 0.6.x or 0.7.x.
reads pair 1
Single-end or first paired-end reads file in FASTA, FASTQ, or BAM format. For paired-end data, this should be the forward ("*_1" or "left") input file. Note: the FASTA or FASTQ can be gzipped.
reads pair 2
The reverse ("*_2" or "right") input file for paired-end reads in FASTA, FASTQ, or BAM format. Note: the FASTA or FASTQ can be gzipped.

Output Files

SAM file
The aligned sequences are output in SAM format. For more details on this alignment file, see the SAM format specification at http://samtools.sourceforge.net/SAM-1.3.pdf.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
any

Language:
C;Perl

Version Comments

Version	Release Date	Description
2	2014-07-17	Beta: Updated to BWA 0.7.4, changed to use dynamic FTP-hosted index files, and switched to HTML-based doc
1	2011-05-02