RNA-seq Analysis


Overview

GenePattern offers a set of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, differential expression, quality control metrics, visualization, and file utilities. The tools released as GenePattern modules are widely-used. We continue to release new and updated tools as they become available. To be informed when new capabilities are added, check this page or sign up for our Twitter feed.

How to Use the RNA-seq Tools

We recommend that you run these modules on a local GenePattern server, due to the average size of the input files for these modules. You can upload your data, and make use of the new file management features in GenePattern 3.6, but large data will take a while to upload, depending on your connection speed, data size, and current available bandwidth. Alternately, on the public GenePattern server, If you have a GenomeSpace account, and already have data stored there, you can link your GenePattern account with your GenomeSpace account and make use of the improved file management features in GenePattern 3.6.

COMPATIBILITY NOTE: A number of tools are built for Unix-based (Mac and Linux) systems and will not run on Windows machines. They are the Tuxedo suite tools (Bowtie, TopHat, Cufflinks, Cuffmerge, Cuffcompare, and Cuffdiff) and BWA.

You can install a local GenePattern server by doing the following:

  1. If you have not downloaded GenePattern and installed it on your local machine, follow the instructions on the Download GenePattern page.
  2. If you have already downloaded and installed a GenePattern server, you can install any of these modules from the GenePattern public repository, avaliable from  Modules & Pipelines> Install From Repository, in the navigation bar in your GenePattern server.
  3. Enable the browsing of your GenePattern server's file path. This will allow you to send RNA-seq files to GenePattern modules without uploading them. See these instructions for more details.

Internal Broad Institute Server

Broad Institute members and collaborators can use the GPBroad server to send RNA-seq files directly to analysis modules. Community members can contact gp-help@broadinstitute.org to enable access to their RNA-seq files.

Reference Genomes

The TopHat, Bowtie, and BWA GenePattern modules provide pre-built reference genome indexes for a number of species. If you need an index for a species that is not hosted, email us at gp-help@broadinstitute.org. See this FAQ for more information on how to find other reference genome indexes.

Several of the modules accept reference genome annotation files (GTF files) and/or whole genome FASTA files.  A list of these is available on our FTP site:

To use one of these files in a GenePattern module, click the Specify URL radio button under the input box for the GTF file parameter, and paste in the URL for the annotation file you want to use.

RNA-seq Tools in GenePattern

Tuxedo Suite

GenePattern provides support for the Tuxedo suite of Bowtie, Tophat, and Cufflinks, as described in Trapnell et al (2012) (Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks).

  • Bowtie (Version in GenePattern public repository:  2.1.0)

Bowtie is short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. For more information, please refer to the Bowtie documentation. The GenePattern Bowtie modules consist of the following tools:

  • Bowtie.aligner: the Bowtie short-read alignment algorithm
  • Bowtie.indexer: Bowtie requires an indexed genome to run. The Bowtie.indexer module accepts a FASTA file containing the target genome to which reads will be mapped, and builds the required index files. The GenePattern module provides a large number of pre-built indexes.
  • Tophat (Version in GenePattern public repository: 2.0.11)

TopHat is a fast splice junction mapper.  TopHat uses Bowtie to map RNA-seq reads to a reference genome, then analyzes the mapping results to identify splice junctions between exons. For more information about the algorithm, please refer to the TopHat documentation.

  • Cufflinks (Version in GenePattern public repository: 2.0.2)

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples. It accepts aligned RNA-seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one. For more information, please refer to the Cufflinks documentation. Cufflinks contains several accessory tools:

  • Cuffcompare (GenePattern module name: Cufflinks.cuffcompare): Cuffcompare helps analyze the transcribed fragments (transfrags) in an assembly by comparing assembled transcripts to a reference annotation and tracking Cufflinks transcripts across multiple experiments (e.g., across a time course). For more information, please refer to the Cufflinks documentation. (Note that this module is currently at Cufflinks version 1.3.0. The update to version 2.1.0 should be available in July. Please contact us for more information)
  • Cuffmerge (GenePattern module name: Cufflinks.cuffmerge): Cuffmerge merges together several Cufflinks assemblies. It also runs Cuffcompare and automatically filters a number of transfrags that are probably artfifacts.
  • Cuffdiff (GenePattern module name: Cufflinks.cuffdiff): Cuffdiff identifies significant changes in transcript expression, splicing, and promoter use.
  • Fpkm_trackingToGct: This module converts a Cufflinks FPKM_tracking file to GCT format, which can be used in many other tools in GenePattern. The FPKM_tracking file format is a tab-delimited format produced by Cufflinks.
  • Read_group_trackingToGct: This module converts a Cuffdiff v2.0.2 read_group_tracking file to GCT format and CLS class file, with option of expression value column selection--raw fragments, internally scaled fragments, externally scaled fragments, or normalized FPKM values.

BWA (Version in GenePattern: 0.5.9)

 For more information, please refer to the BWA documentation. The GenePattern BWA modules consist of the following tools:

  • BWA.aln: This module executes the "aln" alignment option of BWA, which aligns Illumina sequence reads of up to 100 bp.
  • BWA.bwasw: This module executes the "bwasw" alignment option of BWA, which aligns sequences of 70bp to 1Mbp.
  • BWA.indexer: This module builds a BWA-compatible index from a set of DNA sequences in FASTA format.  For more information, please refer to the BWA documentation.

Scripture

Scripture is a method for transcriptome reconstruction that relies solely on RNA-seq reads and an assembled genome to build a transcriptome ab initio. Scripture has been implemented in GenePattern as a pipeline containing several of the functions wrapped as individual modules. Please note: the modules must be executed as part of the Scripture pipeline. For more information, please refer to the Scripture documentation. Available Scripture pipelines are:

  • ScripturePrealigned
  • ScripturePipeline

RNA-SeQC

This module calculates useful metrics for determining the quality of RNA-seq data such as depth of coverage, rRNA contamination, continuity of coverage, and GC bias.  For more information, including a suggested workflow for preprocessing your data files, see the in-depth article about RNA-seq QC in GenePattern.

Integrative Genomics Viewer (IGV)

IGV is a visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations. For more information, please refer to the IGV documentation.

Picard

The Picard tools are widely-used utilities for manipulating SAM/BAM files, and we have wrapped a number of them for GenePattern.  For more information on the SAM/BAM file format, see the SAMtools page.  For more information about the Picard command-line tools, see the Picard site.

  • Picard.AddOrReplaceGroups: This module replaces all read groups in an input SAM or BAM file with a new read group provided by the user and assigns all reads to this read group in the output.
  • Picard.BamToSam: This module converts a BAM file to a SAM file. BAM is the binary version of SAM.
  • Picard.CreateSequenceDictionary:This module reads FASTA or FASTA.GZ files containing reference sequences, and writes them as a SAM file containing a sequence dictionary.
  • Picard.FastqToSam: This module converts a FASTQ file to SAM or BAM format. FASTQ format stores sequences and Phred quality scores in a single file.
  • Picard.MarkDuplicates: This module examines aligned records in a SAM or BAM file to locate duplicate reads. All records are then written to the output file with the duplicate records flagged.
  • Picard.ReorderSam: This module reorders reads in a SAM or BAM file to match the contig ordering in a provided reference file, as determined by exact name matching of contigs. Reads mapped to contigs absent from the new reference are dropped.
  • Picard.SamToBam: This module converts a SAM file to a BAM file. BAM is the binary version of SAM.
  • Picard.SamToFastq: This module converts a SAM or BAM file to FASTQ format with the Picard file conversion tool. FASTQ format stores sequences and Phred quality scores in a single file.
  • Picard.SortSam: This module sorts a SAM or BAM file according to a parameter specified by the user and outputs a sorted SAM or BAM file.

SAMtools

SAMtools are widely-used utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.  We have started to wrap these tools for GenePattern, and will continue to add to the SAMtools modules.  For more information on the SAM/BAM file format or about the SAMtools utilities, see the SAMtools site.

  • SamTools.FastaIndex: This module indexes a reference sequence in FASTA format. The index file is given the extension FAI.

 

Legacy Tool

ExprToGct: This module converts a file in EXPR format to GCT format. The EXPR file format is a tab-delimited format produced by Cufflinks version 1 (deprecated in Cufflinks version 2 and higher).

Back to top