GenePattern - CummeRbund.SelectedGeneReport (v1) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Cuffdiff visualization package providing plots based on a single user-specified gene.

Author: Loyal Goff, MIT Computer Science and Artificial Intelligence Lab

Contact:

gp-help@broadinstitute.org

Algorithm Version: 2.0.0

Summary

CummeRbund is a visualization package designed to help you navigate through the many inter-related files produced from a Cuffdiff RNA-Seq differential expression analysis and visualize the relevant results. CummeRbund helps promote rapid analysis of RNA-Seq data by aggregating, indexing, and allowing you to easily visualize and create publication-ready figures of your RNA-Seq data.

CummeRbund works with the output of the Cuffdiff module, processing its output files into a database to be used for reporting and plotting. The results are indexed to speed up access to specific feature data (genes, isoforms, transcript start sites, coding sequences, etc.), and preserve the various relationships between these features. Creation of this database means that the expression values and other results are stored in a rapidly accessible form, quickly searchable for future use in the other CummeRbund reporting modules or downloadable for direct use in CummeRbund with R for custom reports. For more details about CummeRbund, see the website and manual.

There are four CummeRbund modules available, each allowing you to examine your Cuffdiff results from a different perspective. All of the modules allow reporting at either the aggregate or the replicate level and can present quantification metrics at the level of genes, isoforms, transcription start sites, or coding sequences. The CummeRbund.QcReport provides high-level visualizations allowing for comparisons across all conditions and all genes - for example, to look at the distribution of expression values across conditions - to spot similarities and differences and to see the relationship between conditions.

The other modules allow you to focus on specific conditions and/or genes. The CummeRbund.SelectedConditionReport provides visualizations across all genes, but limited to a specific set of conditions so that you can compare individual condition pairs. The CummeRbund.GeneSetReport allows you to focus on a specific list of genes to be visualized, while the CummeRbund.SelectedGeneReport is focused on a single user-chosen gene. Both the GeneSetReport and the SelectedGeneReport can be further constrained to a selected set of conditions. The plots provided by each module differs based on the slice of data to be examined; the visualization possible vary due to reasons of both performance and practicality of visual presentation.

CummeRbund is a collaborative effort between the Computational Biology group led by Manolis Kellis at MIT's Computer Science and Artificial Intelligence Laboratory, and the Rinn Lab at the Harvard University department of Stem Cells and Regenerative Medicine - See more at: http://compbio.mit.edu/cummeRbund/#sthash.dunKB0RP.dpuf

CummeRbund is a collaborative effort between the Computational Biology group led by Manolis Kellis at MIT's Computer Science and Artificial Intelligence Laboratory, and the Rinn Lab at the Harvard University department of Stem Cells and Regenerative Medicine. This document is adapted from the CummeRbund manual for release 2.0.0.

Usage

Unlike most modules in GenePattern, the CummeRbund reporting modules require the entire output of a Cuffdiff job as they work with not just one or two files but rather with all of the Cuffdiff output files. Simply drag the top-level Cuffdiff job folder into the 'cuffdiff.input' parameter from the 'Jobs' tab ('Recent Jobs' in versions of GenePattern before 3.8.0) or from the Job Results page. The CummeRbund modules can also be directly accessed from the context menu of jobs in either of these locations. Remember, you are submitting the entire job folder and not just a single file.

Alternatively, once a given job has been run through any one of the CummeRbund reporting modules, a reusable database file named cuffData.db will be produced that can be submitted in place of the Cuffdiff job for other CummeRbund reports. You can use this file for job submission via all of the usual GenePattern mechanisms or you can submit the entire CummeRbund job folder for a subsequent CummeRbund job in the same way as described above for Cuffdiff jobs. You are highly encouraged to reuse these database files wherever possible as your jobs will run much quicker and use less storage space than by starting from scratch with a Cuffdiff job.

CummeRbund.SelectedGeneReport will produce a variety of result files in the form of both plots and text tables; these are described further in the Output Files section below. You can use the feature.level parameter to control whether these should be generated at the level of genes, isoforms, transcript start sites (TSS), or coding sequences (CDS), although the Similarity plots will always be generated at the "genes" level.

The report.as.aggregate parameter controls whether reporting should be done for individual replicates or aggregate condition/sample values. The default is to use aggregate sample values. Similar to feature.level, however, the Similarity plots are always generated for aggregate samples.

For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page.

References

Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. 2013;31:46-53.

Links

CummeRbund website and manual.

Parameters

Name	Description
cuffdiff input *	A Cuffdiff job, a previous CummeRbund job, or a cuffData.db file from a previous CummeRbund job
feature id *	The gene or feature of interest. This can be a gene symbol (short name), gene ID, isoform_id, tss_group_id, or cds_id.
selected.conditions	Specifies the conditions (samples) to be used in the plots. This should be a comma-separated list of conditions, using the same names as in the upstream Cuffdiff job; leave this blank to use all conditions. The Similarity plots will always operate across all conditions.
find.similar	Optionally, find and plot the top genes (up to this count) with an expression profile most similar to the given gene of interest. If blank, this will be skipped.
output format *	The output file format.
feature level *	Feature level for the report. Note that the Similarity plots will always be generated at the "genes" level.
report as aggregate *	Controls whether reporting should be done for individual replicates or aggregate condition/sample values. The default is to use aggregate sample values. Note that the Similarity plots always show aggregate samples.
log transform *	Whether or not to log transform the FPKM values. This directs that the y-axis will be drawn on a log10 scale.

* - required

Input Files

<cuffdiff.input> (required)
A Cuffdiff job, a previous CummeRbund job, or a cuffData.db file from a previous CummeRbund job. Unlike most modules in GenePattern, the CummeRbund reporting modules require the entire output of a Cuffdiff job as they work with not just one or two files but rather with all of the Cuffdiff output files. Simply drag the top-level Cuffdiff job folder into the 'cuffdiff.input' parameter from the 'Jobs' tab ('Recent Jobs' in versions of GenePattern before 3.8.0) or from the Job Results page. The CummeRbund modules can also be directly accessed from the context menu of jobs in either of these locations. Remember, unless you use a cuffData.db file, you are submitting the entire job folder and not just a single file.

Output Files

cuffData.db
The RSQLite database created from the original Cuffdiff job. This file can be used in other CummeRbund jobs to avoid the need for extra computation and storage, in which case the new job will instead hold a link back to the file from the original job.
SelectedGene.ExpressionBarplot
A barplot of the FPKM values (Fragments Per Kilobase of transcript per Million mapped read) with confidence intervals, calculated for the selected gene across all samples (or replicates) in the Cuffdiff dataset. The value for each replicate is noted as a black dot along the confidence interval, while the top of the bar represents the value for a given aggregate sample. See this explanation from the Cufflinks website and the this entry from the Cufflinks FAQ for more information about FPKM values.
SelectedGene.ExpressionPlot
A line plot of the FPKM values with confidence intervals, calculated for the selected gene across all samples (or replicates) in the Cuffdiff dataset. The value for each replicate is noted as a dot of matching color along the confidence interval, while the aggregate sample value is noted as a black dot within the confidence interval. At this time, the line plots can only display tracking IDs rather than gene symbols.
SelectedGene.SimilarityExpressionBarplot
An FPKM barplot of a number of genes (up to the find.similar count) with an expression profile most similar to that of the selected gene. The Similarity plots will always use aggregate sample values.
SelectedGene.SimilarityExpressionPlot
An FPKM line plot of a number of genes (up to the find.similar count) with an expression profile most similar to that of the selected gene. The Similarity plots will always use aggregate sample values. At this time, the line plots can only display tracking IDs rather than gene symbols.
stdout.txt (and stderr.txt)
A log of output (and errors) produced during the database creation and plotting process. In case of an error, check both of these files for more details. The module has been designed to skip those plots where it encounters a problem along the way, continuing on to the next; if a given plot is missing, it should be noted in one of these files along with a reason if one could be determined.

Example Data

There is an example reusable database file available on our FTP site. This was generated using the example data and workflow from the Differential analysis of gene regulation at transcript resolution with RNA-seq article referenced above, by Trapnell, et al.

Requirements

CummeRbund.SelectedGeneReport requires R 2.15. When installing this module, GenePattern will automatically check for the presence of this exact version of R and will not proceed without it. See the section of our Administrator's Guide on the R Installer plug-in for details. Installing this module requires a number of supporting R packages from CRAN and Bioconductor; it will also check for their presence and install any that are missing in the process. These packages will be installed in a separate area specific to GenePattern and will not affect any other R library on the machine.

Please install R2.15.3 instead of R2.15.2 before installing the module. The GenePattern team has confirmed test data reproducibility for this module using R2.15.3 compared to R2.15.2 and can only provide limited support for other versions. The GenePattern team recommends R2.15.3, which fixes significant bugs in R2.15.2, and which must be installed and configured independently as discussed in Using Different Versions of R and Using the R Installer Plug-in. These sections also provide patch level fixes that are necessary when additional installations of R are made and considerations for those who use R outside of GenePattern.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
any

Language:
R

Version Comments

Version	Release Date	Description
0.17	2015-10-13	Updated to make use of the R package installer.