Posted on Sunday, September 30, 2012 at 12:32PM by The GenePattern Team
In GenePattern, you use the ComparativeMarkerSelection module to identify the genes (if any) that are differentially expressed between two phenotype classes. Typically, this is a three-step process:
The GenePattern Differential Expression Analysis protocol provides example files and step-by-step instructions for running ComparativeMarkerSelection and its companion modules. If you are unfamiliar with differential expression analysis or ComparativeMarkerSelection, start here:
The information provided in this section supplements the information provided in the Differential Expression Analysis protocol and the ComparativeMarkerSelection documentation. It assumes that you have walked through the Differential Expression Analysis protocol as described in the Basic Instructions above.
ComparativeMarkerSelection requires gene expression data in the GCT or RES file format.
Phenotype classes
ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).
If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types.
The phenotype class file identifies the tumor and normal samples:
75 2 1
# Normal Tumor
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
The confounding variable class file identifies the tissue type of each sample:
75 6 1
# colon kidney prostate uterus human-lung breast
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type.
ComparativeMarkerSelection uses a permutation test to estimate the significance (p-value) of the test statistic score. If the data set includes at least 10 samples per class, use the default value of 1000 permutations to ensure sufficiently accurate p-values.
If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value. Specify a value of 0 permutations to use asymptotic p-values instead. In this case, ComparativeMarkerSelection computes p-values assuming the test statistic scores follow Student's t-distribution (rather than using the test statistic to create an empirical distribution of the scores).
ComparativeMarkerSelection also provides two additional options:
By default, ComparativeMarkerSelection expects non-log-transformed data. Some calculations, such as Fold Change, will produce incorrect results when log transformed data is provided and not indicated. To indicate that your data are log transformed, be sure to set the _log transformed data _parameter to "yes".
By default, ComparativeMarkerSelection performs a two-sided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. Optionally, use the test direction parameter to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1.
ComparativeMarkerSelection provides several methods of calculating differential expression. By default, the module uses the t-test statistic. Optionally, you can choose to use the signal-to-noise ratio (SNR) or paired T-test statistic instead.
The T-Test computes the standardized mean difference between the two classes.
ComparativeMarkerSelection also provides variations on the T-Test:
Signal-to-noise ratio is computed by dividing the difference of class means by the sum of their standard deviations.
ComparativeMarkerSelection also provides variations on the signal-to-noise ratio:
The Paired T-Test can be used to analyze paired samples; for example, samples taken from patients before and after treatment. This test is used when the cross-class differences (e.g. the difference before and after treatment) are expected to be smaller than the within-class differences (e.g., the difference between two patients). For example if you are measuring weight gain in a population of people, the weights may be distributed from 90 lbs. to say 300 lbs. and the weight gain/loss (the paired variable) may be on the order of 0-30 lbs. So the cross-class difference ("before" and "after") is less than the within-class difference (person 1 and person 2).
Where the standard T-Test takes the mean of the difference between classes, the Paired T-Test takes the mean of the differences between pairs (for more information, refer to the Wikipedia article on the paired T-Test.)
For the Paired T-Test, paired samples in the expression data file must be arranged by class, where the first samples in each class are paired, the second samples are paired, and so on. For example, sample pairs A1/B1, A2/B2 and A3/B3 would be ordered in an expression data file as A1, A2, A3, B1, B2, B3. Note that your data must contain the same number of samples in each class in order to use this statistic.