Agglomerative hierarchical clustering of samples and/or genes
Author: Edwin Juarez, University of California San Diego
Contact:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/genepattern-help
ejuarez@ucsd.edu
Algorithm Version:
Cluster analysis is a means of discovering, within a body of data, groups whose members are similar for some property. Clustering of gene expression data is geared toward finding genes that are expressed or not expressed in similar ways under certain conditions.
Given a set of items to be clustered (items can be either genes or samples), agglomerative hierarchical clustering (HC) recursively merges items with other items or with the result of previous merges, according to the distance between each pair of items, with the closest item pairs being merged first. As a result, it produces a tree structure, referred to as dendogram, whose nodes correspond to:
The HierarchicalClustering module produces a CDT file that contains the original data, but reordered to reflect the clustering. Additionally, either a dendrogram or two dendrogram files are created (one for clustering rows and one for clustering columns). The row dendrogram has the extension GTR, while the column dendrogram has the extension ATR. These files describe the order in which nodes were joined during the clustering.
The module includes several preprocessing options. The order of the preprocessing operations is:
Note that all three of these parameters can be found in HierarchicalClustering V6. If you would like to see them added to HierarchicalClustering V7 feel free to request it on the GenePattern help forum: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/genepattern-help
conda create --name GP_hierarchical_clustering_env pip
source activate GP_hierarchical_clustering_env
pip install -r requirements.txt
Name | Description |
---|---|
input filename * | input data file name - .gct, .res, .pcl |
column distance measure * |
Distance measure for column (sample) clustering. Options include:
|
row distance measure * |
Distance measure for row (gene) clustering. Options include:
|
clustering method * |
Hierarchical clustering method to use. Options include:
|
row center | Specifies whether to center each row (gene) in the data. Centering each row subtracts the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0. Default: no |
row normalize | Specifies whether to normalize each row (gene) in the data. Normalizing each row multiplies all values in each row of data by a scale factor S so that the sum of the squares of the values in each row is 1.0 (a separate S is computed for each row). Default: no |
column center | Specifies whether to center each column (sample) in the data. Centering each column subtracts the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0. Default: no |
column normalize | Specifies whether to normalize each column (sample) in the data. Normalizing each column multiplies all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column). Default: no |
output base name * |
Base name for the output files |
output distance matrix | Whether or not output the pair-wise distance matrix. If true, the distance between each column will be computed, which can be very computationally intensive. If unsure, leave as False. Default: False. |
* - required
HierarchicalClustering is distributed under a modified BSD license available at https://raw.githubusercontent.com/genepattern/HierarchicalClustering/develop/LICENSE
Task Type:
Clustering
CPU Type:
any
Operating System:
any
Language:
Python 3.6
Version | Release Date | Description |
---|---|---|
8.1 | 2018-05-09 | Improving performance |
8 | 2018-02-12 | Updating the docs to meet automatic build requirements |
7 | 2018-02-01 | Ported to Python 3.6 |
6 | 2013-03-13 | Updated for Java 7 |
5 | 2009-02-10 | Row clustering turned off by default |
4 | 2008-08-20 | Report error when out of memory. Added 64-bit Linux support. Fixed bug that caused mean centering to be performed when median centering was selected. |
3 | 2007-03-08 | Changed default distance measure |
2 | 2005-12-16 | Fixes bugs in previous version |