Finds significant changes in transcript expression, splicing, and promoter use
Author: Cole Trapnell et al, University of Maryland Center for Bioinformatics and Computational Biology
Algorithm Version: Cufflinks 2.0.2
Cuffdiff finds significant changes in transcript expression, splicing, and promoter use. You can use it to find differentially expressed genes and transcripts, as well as genes that are being differentially regulated at the transcriptional and post-transcriptional level.
To identify a gene or transcript as differentially expressed, Cuffdiff tests the observed log fold change in its expression against the null hypothesis of no change (i.e., that the true log fold change = zero). Because measurement error, technical variability, and cross-replicate biological variability might result in an observed log fold change that is nonzero even if the gene/transcript is not differentially expressed, Cuffdiff also assesses the significance of each comparison. For more information on the model, see Trapnell et al (2013) or see the "How It Works" page on the Cufflinks site.
Cuffdiff also groups transcripts into biologically meaningful groups, such as transcripts that share the same transcription start site (TSS), in order to identify genes that are differentially regulated at the transcriptional or post-transcriptional level.
Cuffdiff was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the Cufflinks documentation for release 2.0.2.
The Cuffdiff module takes two or more fragment alignment SAM/BAM files (from TopHat or other read aligner), as well as a GTF file containing transcript annotations (such as merged.gtf from Cuffmerge) as input.
Cuffdiff produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former shows changes in splicing, and the latter shows changes in relative promoter use within a gene.
Provide your aligned files organized by condition. Use the Add Another Condition button if you have multiple conditions. For each condition you can specify multiple replicates by dragging in the associated BAM files. The condition labels are required and each one should be unique in order for the module to associate them with the replicates. These will be used as the labels in the cuffdiff result files.
Cuffdiff has several methods for normalizing library sizes (i.e. sequencing depths).
Regardless of the choice of normalization.method, Cuffdiff reports expression estimates both in units of FPKM (in fpkm_tracking files) and as read counts (in count_tracking files). When scaling to units of FPKM, Cuffdiff requires a count of a RNA-Seq sample library’s “total mapped reads”, i.e., the FPKM denominator. Setting FPKM.scaling to 'compatible-hits' instructs Cuffdiff to only include in its count of total mapped reads those fragments compatible with the transcripts identified in the provided annotation (GTF file). Setting FPKM.scaling to 'total-hits' instructs Cuffdiff to include all of a sample library’s mapped reads in its count of total mapped reads, including those not compatible with any of the transcripts identified in the provided GTF file.
The Cuffdiff tool provides a number of additonal options and switches that are not directly available through this module's paramters. The additional.cuffldiff.options parameter is provided to pass these through if you feel that you need them. To use it, simply specify the extra option(s) along with any arguments in the input text field separated by spaces. At this time, this parameter unfortunately does not easily support options which require a file argument. Check the Cufflinks manual for more details of the available options. Also note that there may be additional undocumented options; manually running the cufflinks executable at the command line with no arguments may show even more options. If you feel that a particular missing option would be of broad general interest, please contact the GenePattern team and we will look into adding it. Use of this parameter is recommended for expert use only; use it at your own discretion. The GenePattern team does not explicitly test all of the possible options that may be passed through using this parameter and can only provide limited support.
For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page.
If you want Cuffdiff to look for changes in primary transcript expression, splicing, coding output, and promoter use, the input GTF transcript file needs to be annotated with certain attributes. These attributes are:
The GTFs hosted on the GenePattern FTP site contain these annotations.
This module may produce some empty files. This does not mean that the algorithm has failed. It may be the result when no transcripts with differential expression are detected. In particular, this may occur if there is no differential expression.
It may also be the result of using an input GTF transcript file that does not have the p_id annotation. This attribute is attached to Cuffmerge output only when it is run with a reference annotation that includes coding sequence (CDS) records. Differential CDS analysis in Cuffdiff is only performed when all isoforms of a gene have p_id attributes. The CDS output files will be empty if there is no p_id attribute in the input GTF.
Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. 2013;31:46-53.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578.
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 Sep 1;27(17):2325-9.
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511-515.
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105-1111.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.
|aligned files *||A set of aligned files grouped by condition.|
|GTF file *||A transcript annotation file in GTF/GFF format produced by cufflinks, cuffcompare, cuffmerge, or other source (such as a reference annotation GTF). See the Input Files section for more information, particularly on required attributes.|
|time series||Analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Default: no|
|normalization method *||The normalization method to be used. Choices are geometric mean (the default), upper quartile, or raw mapped count (classic FPKM) normalization. See the usage section for a discussion of these methods. Default: geometric|
|FPKM scaling *||
Controls how Cuffdiff includes mapped fragments toward the number used in the FPKM denominator. Use compatible-hits (the default) to count only those fragments compatible with some reference transcript. Use total-hits to count all fragments, even those not compatible with a reference transcript.
|frag bias correct||A genome reference multi-FASTA file for the bias detection and correction algorithm. For more information on this algorithm, see "How It Works" on the Cufflinks website.|
|multi read correct *||Instructs Cuffdiff to do use its two-stage read weighting algorithm to more accurately distribute a multi-mapped read's count contribution across the multiple loci the read mapped to. Default: yes. Note that this default differs from the Cuffdiff tool's default. See the Usage section above for details.|
|min alignment count||The minimum number of alignments in a locus needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not significant, and the locus's observed changes do not contribute to correction for multiple testing. Default: 10|
|FDR||The allowed false discovery rate.|
|mask file||This file tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF/GFF file. It is recommended that annotated rRNA, mitochondrial transcripts, and other abundant transcripts you want to ignore in your analysis be included in this file.|
|library type *||The library type used to generate reads. The choices are inferred, fr-unstranded, fr-firststrand, fr-secondstrand, ff-unstranded, ff-firststrand, ff-secondstrand, and transfrags. The default is inferred, meaning that no library type information is passed.|
|skip diff exp *||
Tells Cuffdiff to perform quantification and normalization only and to skip its differential expression calculations which are computationally expensive. The default is to compute differential expression.
|additional cuffdiff options||Additional options to be passed along to the Cuffdiff program at the command line. This parameter gives you a means to specify otherwise unavailable Cuffdiff options and switches not supported by the module; check the Cufflinks manual for details. Note that the information at this link may refer to a subsequent version of Cufflinks. Recommended for experts only; use this at your own discretion.|
* - required
The following may be useful for advanced users who wish to use the additional.cuffdiff.options parameter. This is the 'usage' output from running cuffdiff at the command-line, which gives a list of all of the available options and switches. Note that this was generated by Cuffdiff v2.0.2 and that the options here may differ from the documentation provided online at the Cufflinks website due to subsequent version updates.
cuffdiff v2.0.2 (3524M) ----------------------------- Usage: cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam] Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam General Options: -o/--output-dir write all output files to this directory [ default: ./ ] --seed value of random number generator seed [ default: 0 ] -T/--time-series treat samples as a time-series [ default: FALSE ] -c/--min-alignment-count minimum number of alignments in a locus for testing [ default: 10 ] --FDR False discovery rate used in testing [ default: 0.05 ] -M/--mask-file ignore all alignment within transcripts in this file [ default: NULL ] -b/--frag-bias-correct use bias correction - reference fasta required [ default: NULL ] -u/--multi-read-correct use 'rescue method' for multi-reads (more accurate) [ default: FALSE ] -N/--upper-quartile-norm use upper-quartile normalization [ default: FALSE ] --geometric-norm use geometric mean normalization [ default: TRUE ] --raw-mapped-norm use raw mapped count normalized (classic FPKM) [ default: FALSE ] -L/--labels comma-separated list of condition labels -p/--num-threads number of threads used during quantification [ default: 1 ] --no-diff Don't generate differential analysis files [ default: FALSE ] Advanced Options: --library-type Library prep used for input reads [ default: below ] -m/--frag-len-mean average fragment length (unpaired reads only) [ default: 200 ] -s/--frag-len-std-dev fragment length std deviation (unpaired reads only) [ default: 80 ] --num-importance-samples number of importance samples for MAP restimation [ DEPRECATED ] --num-bootstrap-samples Number of bootstrap replications [ DEPRECATED ] --bootstrap-fraction Fraction of fragments in each bootstrap sample [ DEPRECATED ] --max-mle-iterations maximum iterations allowed for MLE calculation [ default: 5000 ] --compatible-hits-norm count hits compatible with reference RNAs only [ default: TRUE ] --total-hits-norm count all hits for normalization [ default: FALSE ] --poisson-dispersion Don't fit fragment counts for overdispersion [ default: FALSE ] -v/--verbose log-friendly verbose processing (no progress bar) [ default: FALSE ] -q/--quiet log-friendly quiet processing (no progress bar) [ default: FALSE ] --no-update-check do not contact server to check for update availability[ default: FALSE ] --emit-count-tables print count tables used to fit overdispersion [ default: FALSE ] --max-bundle-frags maximum fragments allowed in a bundle before skipping [ default: 500000 ] --num-frag-count-draws Number of fragment generation samples [ default: 1000 ] --num-frag-assign-draws Number of fragment assignment samples per generation [ default: 1 ] --max-frag-multihits Maximum number of alignments allowed per fragment [ default: unlim ] --min-outlier-p Min replicate p value to admit for testing [ default: 0.01 ] --min-reps-for-js-test Replicates needed for relative isoform shift testing [ default: 3 ] --no-effective-length-correction No effective length correction [ default: FALSE ] --no-length-correction No effective length correction [ default: FALSE ] Debugging use only: --read-skip-fraction Skip a random subset of reads this size [ default: 0.0 ] --no-read-pairs Break all read pairs [ default: FALSE ] --trim-read-length Trim reads to be this long (keep 5' end) [ default: none ] Supported library types: ff-firststrand ff-secondstrand ff-unstranded fr-firststrand fr-secondstrand fr-unstranded (default) transfrags
A transcript annotation file in GTF/GFF format. Cuffdiff requires that transcripts in the input GTF be annotated with the tss_id and p_id attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. See the Cuffdiff documentation for a discussion of these required attributes. For more information on GTF format, see the specification.
A common file used as input here is merged.gtf from Cuffmerge. The GenePattern FTP site also hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).
A text file containing a label for each sample, one label per line. These labels replace the default "q0, q1, ...qN" labeling for each sample in the tracking output files. While this parameter is optional, using it may make downstream analysis of your samples easier. Note: this field should not be used with GP 3.8.0+.
A genome reference multi-FASTA file. This reference genome file instructs Cuffdiff to run the bias detection and correction algorithm. For more information on this algorithm, see "How It Works" on the Cufflinks website. For more information on the FASTA format, see this description.
The GenePattern FTP site hosts a number of reference genomes, available in a dropdown selection (requires GenePattern 3.7.0+).
A GTF/GFF file that specifies transcripts to be ignored.
For more information on the formats of the individual output files, see the Cufflinks Web site.
isoforms.fpkm_tracking: Transcript FPKMs
genes.fpkm_tracking: Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene ID.
cds.fpkm_tracking: Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing the p_id (ID of the coding sequence each transcript), independent of tss_id.
tss_groups.fpkm_tracking: Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id (transcription start site [TSS] ID), which is the ID of the transcript's inferred start site, determining which primary transcript this processed transcript is believed to come from).
Count tracking files
Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript, and gene in each sample. Primary transcript and gene counts are computed by summing the counts of transcripts in each primary transcript group or gene group. The results are output in the format described here. There are four count tracking files:
isoforms.count_tracking: Transcript counts.
genes.count_tracking: Gene counts. Tracks the summed counts of transcripts sharing each gene ID.
cds.count_tracking: Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id.
tss_groups.count_tracking: Primary transcript counts. Tracks the summed counts of transcripts sharing each tss_id.
Read group tracking files
Cuffdiff calculates the expression and fragment count for each transcript, primary transcript, and gene in each replicate. The results are output in per-replicate tracking files in the format described here. There are four read group tracking files:
isoforms.read_group_tracking: Transcript read group tracking.
genes.read_group_tracking: Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene ID in each replicate.
cds.read_group_tracking: Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate.
tss_groups.read_group_tracking: Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate.
Differential expression tests
These tab-delimited files list the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. For each pair of samples x and y, four files are created:
isoform_exp.diff: Transcript differential FPKM.
gene_exp.diff: Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id.
cds_exp.diff: Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id, independent of tss_id.
tss_group_exp.diff: Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
Differential splicing tests: splicing.diff
This tab-delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e., how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.
Differential coding output: cds.diff
This tab-delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e., how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e., multi-protein genes) are listed here.
Differential promoter use: promoters.diff
This tab-delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e., how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e., multi-promoter genes) are listed here..
Read group information: read_groups.info
This tab-delimited file lists, for each replicate, key properties used by Cuffdiff during quantification, such as library normalization factors.
Run information: run.info
This tab-delimited file lists information about a Cuffdiff run to help track what options were provided.
|6||2014-04-02||Provides a new condition-oriented UI, adds a parameter to allow pass through of extra Cuffdiff options, adds a parameter to skip differential expression, clarifies the normalization options|
|5||2013-09-20||Added hosted GTF file selectors and HTML-based docs.|
|4||2013-05-07||Updated to Cufflinks version 2.0.2|
|3||2012-01-13||Updated to Cufflinks.cuffdiff version 1.3.0|
|2||2012-12-23||Updated to Cufflinks.cuffdiff version 1.2.1|