PreprocessDataset (v5)

Performs several preprocessing steps on a res, gct, or odf input file

Author: Joshua Gould, Broad Institute


Algorithm Version:


Most analyses of gene expression data derived from microarray experiments begins with preprocessing of the expression data.  Preprocessing removes platform noise and genes that have little variation so the subsequent analysis can identify interesting variations, such as the differential expression between tumor and normal tissue.  GenePattern provides the PreprocessDataset module for this purpose.  While the module's default parameter values are tailored to Affymetrix expression arrays, we provide guidelines below for its use with Illumina expression arrays.  This module has limited applicability to gene expression data derived from RNA-seq experiments and typically is not employed in RNA-Seq analysis workflows.


This module performs a variety of pre-processing operations including thresholding/ceiling, variation filtering, normalization and log2 transform.  It may be applied to datasets in .gct, .res, or .odf formatted files.

The algorithm conducts the following steps in order.  Each step is optional, controlled by the module's parameter settings.

  1. Set floor and ceiling values.  Any expression value lower than the floor value is set to the floor value.  Any expression value higher than the ceiling value is set to the ceiling value.
  2. Sample-count threshold filtering. Remove a gene if the number of samples with values greater than a specified expression threshold is less than a specified sample count threshold.  A gene whose expression profile contains insufficient values greater than a specified threshold may be an indication of poor-quality data.
  3. Variation filtering.  Remove a gene if the variation of its expression values across the samples does not meet a minimum threshold.  The module uses two measures of variation: fold change (MAX/MIN) and delta (MAX - MIN).  If a gene's fold change is less than a specified minimum fold change OR its delta is less than a specified minimum delta, the gene will be removed from the data set. Genes with little variation across samples are unlikely to be biologically relevant to the downstream analysis.
  4. Row normalization or log2 transform.  If row normalization is enabled, a gene's expression values across all samples are normalized.  Row normalization adjusts gene expression values to remove systematic variation between microarray experiments.  If log2 transform is enabled, each expression value is converted to the log base 2 of the value. When using ratios to compare gene expression between samples, this transformation brings up- and down-regulated genes to the same scale.  For example, ratios of 2 and 0.5, indicating two-fold changes for up- and down-regulated expression, respectively, become +1 and -1.  Row normalization and log2 transform are mutually exclusive: one cannot take the log2 of zero-centered data due to the presence of negative values.

If thresholding and filtering are disabled, then rows may be selected for inclusion by random sampling (without replacement).  The row sampling rate parameter specifies the fraction of genes that will be selected.  If row sampling rate is set to 1, all genes will be selected.

Applicability to RNA-Seq Derived Expression Data

As mentioned in the introduction, this module has limited applicability to expression data derived from RNA-Seq experiments. DNA microarrys have a limited dynamic range for detection due to high background levels arising from cross hybridization and signal saturation.  RNA-Seq data, on the other hand, have very low background signal and a higher dynamic range of expression levels.  Due to RNA-Seq's larger dynamic range, setting floor and ceiling values is unnecessary, as is sample-count threshold filtering.

Variation filtering is also of questionable value when working with expression data derived from RNA-seq experiments.  Because RNA-seq expression data is derived from read counts, researchers view gene or transcript expression measurements as legitimate regardless of their level of variability and may not want to eliminate genes or transcripts from consideration in the downstream analysis.  Unlike microarray data, there are no default values for min fold change or min delta that would generally apply to RNA-seq derived expression data; thus, rather than eliminate features due to levels of variability below arbitrary thresholds, the current practice is to skip variation filtering and retain all features in RNA-seq derived expression data.

In order to derive transcript or gene expression levels from RNA-Seq read counts, the counts must be normalized to remove biases arising from differences in transcript length and differences in sequencing depth between samples.  For example, longer transcripts will produce more sequencing fragments, and thus more counts, than shorter transcripts.  Similarly, differences in sequencing depth will be reflected in read counts.  FPKM normalization (fragments per kilobase of transcript per million mapped fragments) divides transcript counts by the transcript length and total read count to eliminate these inherent biases.  We assume that GCT-formatted expression data derived from RNA-seq experirments is in units of FPKM (or RPKM for data derived from single-ended sequencing experiments) and has therefore undergone normalization and does not require PreprocessDataset normalization.

For downstream analyses that employ correlation metrics (e.g. clustering, feature selection) it may be useful to log transform the data first.  Due to the wide dynamic range of RNA-Seq data, highly expressed outliers could dominate the calculated correlations and log transforming the data would be one approach to working around this issue (see [Adiconis, X.].  However, if the expression data is to be log transformed, it would first be necessary to add a small number (e.g., 1) to each expression value.  When calculating correlation, this would give more weight to genes with lower expression values.  An alternative approach to the outlier issue not requiring log transformation of the data would be to use a rank correlation metric such as Spearman correlation.

Setting Thresholds and Filters with Illumina Expression Data

While this module has default values which pertain to Affy expression data, it may also be effectively used with Illumina expression data, after first running that data through IlluminaNormalizer and changing the default values in this module to better suit Illumina data. Suggested values are as follows (with thanks to Yujin Hoshida of the Broad Institute):

  • Use a floor value a little above background signal.  You can determine the background signal by calculating the mean signal across all the negative control probes across all samples.  If you are using IlluminaExpressionFileCreator, these values can be found in the "controls" GCT along with the other controls.  If the negative control probe signals are not available, you can instead use the 20 or so gene probes with the lowest mean signal (across all samples).  This should be calculated using the expression levels *before* background correction.
  • no ceiling (ie 0)
  • fold change 3 - 5
  • min delta 300 - 500 after cubic spline normalization in IlluminaNormalizer, or probe filtering based on CV (coefficient of variation)*.  It is recommended that you turn off the PreprocessDataset normalization and variation filter options and use IlluminaNormalizer instead.

*There is currently no module in GenePattern for this last method, probe filtering based on CV, however Yujin has plans to release his own module for this purpose to GPARC (


Kuehn, H., Liberzon, A., Reich, M. and Mesirov, J. P. 2008. Using GenePattern for Gene Expression Analysis. Current Protocols in Bioinformatics. 22:7.12.1–7.12.39.

Adiconis, X., Borges-Rivera, D., et al., Comparative analysis of RNA sequencing methods for dergraded or low-input samples. Nature Methods 10, 623-629 (2013).


Name Description
input filename * Input filename - .res, .gct, .odf
threshold and filter  Flag controlling whether to apply thresholding and variation filter.  The default value is yes.
floor  Value for floor threshold. The default value is 20, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to a value a little above the background signal.
ceiling  Value for ceiling threshold.  The default value is 20,000, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to 0
min fold change  Minimum fold change for variation filter.  The default value is 3, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set between 3 and 5.
min delta  Minimum delta for variation filter.  The default value is 100, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to between 300 and 500 (assuming you've run your data through IlluminaNormalizer and used cubic spline normalization.
num outliers to exclude  Number of outliers per row to ignore when calculating row min and max for variation filter.  If this value is set to n, then then the n smallest and the n largest expression values will ignored.
row normalization  Perform row normalization. Row normalization and log2 transform are mutually exclusive.
row sampling rate  Sample rows without replacement to obtain this fraction of the total number of rows
threshold for removing rows  Threshold for removing rows. Row normalization and log2 transform are mutually exclusive.
number of columns above threshold  Remove row if this number of columns not >= given threshold
log2 transform  Apply log2 transform after all other preprocessing steps.  
output file format  Output file format
output file * Output file name

* - required

Input Files

  1. input filename
    GCTRES, or ODF file containing expression data.

Output Files

  1. output file
    GCT or RES file containing the filtered, preprocessed expression data.

Platform Dependencies

Task Type:
Preprocess & Utilities

CPU Type:

Operating System:


Version Comments

Version Release Date Description
5 2013-12-02 Update to new html doc
4 2013-11-11 Adds support for Illumina; performs log transform; deprecates max sigma binning
3 2005-05-26 Changed default value of ceiling to 20000
2 2005-05-26 Added additional filtering options