Identify differentially expressed genes that can discriminate between distinct classes of samples.
Author: Joshua Gould, Gad Getz, Stefano Monti
Contact:
gphelp@broadinstitute.org
Algorithm Version:
When analyzing genomewide transcription profiles derived from microarray or RNAseq experiments, the first step is often to identify differentially expressed genes that can discriminate between distinct classes of samples (usually defined by a phenotype, such as tumor or normal). This process is commonly referred to as marker (or feature) selection. Marker genes are identified by calculating, for each profiled gene, a test statistic (e.g., ttest) which asseses correlation of the gene's expression profile with a class template. If the value of the test statistic for a specific gene, and thus the degree of differential expression presented by that gene, is significantly greater than what one would expect to see under the null hypothesis (gene is not differentially expressed between classes), that gene is identified as a statistically significant marker gene.
The ComparativeMarkerSelection module takes as input a dataset of expression profiles from samples belonging to two classes and, implementing the statistical tests described above, identifies marker genes which discriminate between the classes.
The ComparativeMarkerSelection module includes several approaches to determine the features that are most closely correlated with a class template and the significance of that correlation. The module computes significance values for features using several metrics, including FDR(BH), QValue, maxT, FWER, FeatureSpecific PValue, and Bonferroni. The results from the ComparativeMarkerSelection algorithm can be viewed with the ComparativeMarkerSelectionViewer. ExtractComparativeMarkerResults creates a derived dataset and feature list file from the results of ComparativeMarkerSelection.
By default ComparativeMarkerSelection expects the data in the input file to not be log transformed. Some of the calculations such as the fold change are not accurate when log transformed data is provided and not indicated. To indicate that your data is log transformed, be sure to set the “log transformed data” parameter to “yes”. Also, ComparativeMarkerSelection requires at least three samples per class to run successfully.
The analytic module takes as input a dataset of expression profiles from samples belonging to two phenotypes. If a dataset contains more than two phenotypes, then there is the option to perform all pairwise comparisons or all oneversusall comparisons. A test statistic (e.g. ttest) is chosen to assess the differential expression between the two classes of samples. Note that technical and biological replicates are handled the same way as independent samples. The significance (nominal Pvalue) of marker genes is computed using a permutation test, which is a commonly used method for assessing the significance of marker genes; see (4) for details.
Selecting class markers is a particular instance of the general multiple hypothesis testing problem. Since several thousand hypotheses are usually tested at once (one per gene), the nominal Pvalues have to be corrected to account for the increased number of potential false positives. For example, if we test 20,000 genes for differential expression, a nominal Pvalue threshold of 0.01 would only ensure that the expected number of false positives is <200 (0.01 x 20,000). ComparativeMarkerSelection includes several methods of correcting for multiple hypothesis testing, including FDR(BH), QValue, maxT, FWER, FeatureSpecific PValue, and Bonferroni; (4) describes their applicability.
Name  Description  

input file * 
Note the following constraints:


cls file * 
The class file. CLS ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (oneversusall) or all class pairs (all pairs). 

confounding variable cls file  The class file containing the confounding variable. CLS
If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types. The phenotype class file identifies the tumor and normal samples: 75 2 1 The confounding variable class file identifies the tissue type of each sample: 75 6 1 Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type. 

test direction *  The test to perform. By default, ComparativeMarkerSelection performs a twosided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be upregulated in either phenotype class. Optionally, use the test direction parameter to specify a onesided test, where the differentially expressed gene must be upregulated for class 0 or for class 1.  
test statistic * 
The statistic to use:


min std 
The minimum standard deviation if test statistic includes min std option. If σ is less than min std, σ is set to min std . 

number of permutations *  The number of permutations to perform (use 0 to calculate asymptotic pvalues using the standard independent twosample ttest). ComparativeMarkerSelection uses a permutation test to estimate the significance (pvalue) of the test statistic score. The number of permutations you specify depends on the number of hypotheses being tested and the significance level that you want to achieve (3). If the data set includes at least 10 samples per class, use the default value of 10000 permutations to ensure sufficiently accurate pvalues.
If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate pvalue. Specify a value of 0 permutations to use asymptotic pvalues instead. In this case, ComparativeMarkerSelection computes pvalues assuming the test statistic scores follow Student's tdistribution (rather than using the test statistic to create an empirical distribution of the scores). Asymptotic pvalues are calculated using the pvalue obtained from the standard independent twosample ttest. 

log transformed data *  Whether the input data has been log transformed. By default ComparativeMarkerSelection expects the data in the input file to not be log transformed. Some of the calculations such as the fold are not accurate when log transformed data is provided and not indicated. To indicate that your data is log transformed, set this parameter to “yes”.  
complete *  Whether to perform all possible permutations. When the complete parameter is set to yes, ComparativeMarkerSelection ignores the number of permutations parameter and computes the pvalue based on all possible sample permutations. Use this option only with small data sets, where the number of all possible permutations is less than 1000.  
balanced *  Whether to perform balanced permutations. When the balanced parameter is set to yes, ComparativeMarkerSelection requires an equal and even number of samples in each class (e.g. 10 samples in each class, not 11 in each class or 10 in one class and 12 in the other).  
random seed *  The seed of the random number generator used to produce permutations  
smooth p values * 
Whether to smooth pvalues by using the Laplace’s Rule of Succession. By default, smooth p values is set to yes, which means pvalues are always less than 1.0 and greater than 0.0. 

phenotype test  Tests to perform when cls file has more than two classes: oneversusall, all pairs. (Note: The pvalues obtained from the oneversusall comparison are not fully corrected for multiple hypothesis testing.)  
output filename *  The name of the output file. 
*  required
Task Type:
Gene List Selection
CPU Type:
any
Operating System:
any
Language:
Java, R
Version  Release Date  Description 

10  20131204  Updated to html doc 
9  20120326  Changed default number of permutations to 10000 
8  20110830  added parameter to specify whether data is log transformed 
7  20100528  Made improvements to error messages 
6  20091230  Fixed bug with using res file with paired ttest 
5  20081024  Added Paired TTest 
4  20080219  Added Paired TTest 
3  20060303  Added additional metrics 
2  20050608  Added restricted permutations option and maxT pvalue 