GenePattern

MutPanning is designed to detect rare cancer driver genes from aggregated whole-exome sequencing data. Most approaches detect cancer genes based on their mutational excess, i.e. they search for genes with an increased number of nonsynonymous mutations above the background mutation rate. MutPanning further accounts for the nucleotide context around mutations and searches for genes with an excess of mutations in unusual sequence contexts that deviate from the characteristic sequence context around passenger mutations.

Author: Felix Dietlein et al., Dana-Farber Cancer Institute and Broad Institute (algorithm), John Liefeld (module)

Contact:

Felix Dietlein (dietlein@broadinstitute.org) for algorithm questions.
Ted Liefeld (jliefeld@cloud.ucsd.edu) for module questions.

Algorithm Version: v2

Introduction

MutPanning analyzes aggregated DNA sequencing data of tumor patients to identify genes that are likely to be functionally relevant, based on their abundance of nonsynonymous mutations or their increased number of mutations in unusual nucleotide contexts that deviate from the background mutational process.

The name MutPanning is inspired by the words "mutation" and "panning". The goal of the MutPanning algorithm is to discover new tumor genes in aggregated sequencing data, i.e. to "pan" the few tumor-relevant driver mutations from the abundance of functionally neutral passenger mutations in the background. Previous approaches for cancer gene discovery were mostly based on mutational recurrence, i.e. they detected cancer genes based on their excess of nonsynonymous mutation above the local background mutation rate. Further, they search for mutations that occur in functionally important genomic positions, as predicted by bioinformatical scores). These approaches are highly effective in tumor types, for which the average background mutation rate (i.e., the total mutational burden) is low or moderate.

The ability to detect driver genes can be increased by considering the nucleotide context around mutations in the statistical model. MutPanning utilizes the observation that most passenger mutations are surrounded by characteristic nucleotide sequence contexts, reflecting the background mutational process active in a given tumor. In contrast, driver mutations are localized towards functionally important positions, which are not necessarily surrounded by the same nucleotide contexts as passenger mutations. Hence, in addition to mutational excess, MutPanning searches for genes with an excess of mutations in unusual sequence contexts that deviate from the characteristic sequence context around passenger mutations. That way, MutPanning actively suppresses mutations in its test statistics that are likely to be passenger mutations based on their surrounding nucleotide contexts. Considering the nucleotide context is particularly useful in tumor types with high background mutation rates and high nucleotide context specificity (e.g., melanoma, bladder, endometrial, or colorectal cancer).

Algorithm

Most passenger mutations occur in characteristic nucleotide contexts that reflect the mutational process active in a given tumor. MutPanning searches for mutations in “unusual” nucleotide contexts that deviate from this background mutational process. In these positions, passenger mutations are rare and mutations are thus a strong indicator of the shift of driver mutations towards functionally important positions.

The main steps of MutPanning are as follows (adopted from Dietlein et al.):

(i) Model the mutation probability of each genomic position in the human exome depending on its surrounding nucleotide context and the regional background mutation rate.

(ii) Given a gene with n nonsynonymous mutations, use a Monte Carlo simulation approach to simulate a large number of random “scenarios” in which n or more nonsynonymous mutations are randomly distributed along the same gene .

(iii) Compare the number and positions of mutations in each random scenario with the observed mutations in gene . Based on these comparisons, derive a p-value for the gene.

(iv) Combine this p-value with additional statistical components that account for insertions and deletions, the abundance of deleterious mutations, and mutational clustering.

The following figure (adopted from Dietlein et al.) illustrates how MutPanning works.

Usage

You can run the algorithm for multiple cancer types at the same time. All you need to run MutPanning is a mutation file (*.maf) and a sample file (*.txt):

- Mutation File (*.maf): mutations that you would like to include in the analysis (example mutation file)

- Sample File (*.txt): contains sample IDs and assocates them with cancer types (example sample file)

If you are unsure about the file formats, please see the description of these files in the parameters or download exemplary files below. Unless you are familiar with the MutPanning algorithm, we recommend running MutPanning with standard parameters. MutPanning was only tested to run with standard parameters.

References

Dietlein F, Weghorn D, Taylor-Weiner A, Richters A, et al. Identification of cancer driver genes based on nucleotide context. Under review. (preprint available on biorxiv)

License

Distributed under the BSD-3-Clause open source license. A copy of the license text is available at https://github.com/genepattern/docker-mutpanning/blob/develop/LICENSE.txt

Parameters

Name	Description
mutation file*	The mutation file should be a tab-delimited standard .maf format and contain the following columns: Hugo_Symbol* (gene name), Chromosome, Start_Position (according to Hg19), End_Position (according to Hg19), Strand (1 or -1), Variant_Classification (e.g. Missense_Mutation, Nonsense_Mutation, In_Frame_Del, Silent, etc.), Variant_Type (e.g. SNP, DEL, INS, etc.), Reference_Allele (reference nucleotide in Hg19), Tumor_Seq_Allele1 (allele A in tumor), Tumor_Seq_Allele2 (allele B in tumor), Tumor_Sample_Barcode. Names in column Tumor_Sample_Barcode should exactly match the names in the sample file (case-sensitive). If you are unsure about the file format, you can download an example here. Make sure that Variant_Classification and Variant_Type are annotated by exactly the same terms/names as in the exemplary mutation file (case-sensitive, hyphenation, underscores, etc.)
sample annotation file*	The sample annotation file should be a tab-delimited .txt file that contains at least two columns labeled Sample* and Cohort. The Sample column contains unique sample name for each sample in the maf file. No duplicates allowed, all samples in the mutation file must be listed in the sample file. The cohort column contains the subcohort (e.g. cancer type) to which the sample belongs (case-sensitive). If you are unsure about the file format, you can download an exemplary sample file here.
min samples per cluster*	Minimum number of samples needed per nucleotide context cluster. Unless you are familiar with the MutPanning algorithm, we recommend running MutPanning with standard parameters (default value 3).
min mutations per cluster*	Minimum number of mutations needed per nucleotide context cluster. Unless you are familiar with the MutPanning algorithm, we recommend running MutPanning with standard parameters (default value 1000).
min samples Bayesian model*	Minimum number of samples needed to calibrate the Bayesian background model. Unless you are familiar with the MutPanning algorithm, we recommend running MutPanning with standard parameters (default value 100).
min mutations Bayesian model*	Minimum number of mutations needed to calibrate the Bayesian background model. Unless you are familiar with the MutPanning algorithm, we recommend running MutPanning with standard parameters (default value 5000).* - required

Input Files

mutation file (example mutation file)

This file lists the positions of all somatic mutations in your cohort. This file should follow the standard mutation annotation format. Several mutation callers (e.g. MuTect) report their somatic mutations in this file format. Each row corresponds to an individual mutation, which is annotated by the following colums.

Hugo_Symbol The nomenclature of the symbol in Hugo nomenclature (genenames.org).

Chromosome The chromosome, on which the mutation was found. Please use X and Y for the sex chromosomes and not 23 and 24.

Start_Position The start position of the mutation (Hg19).

End_Position The end position of the mutation (Hg19).

Strand The strand on which the mutation was detected (both 1/-1 or +/- nomenclature are fine).

Variant_Classification The functional class of the mutation (e.g. Silent, Missense_Mutations etc.). Please use the standard nomenclature used in MAF files. If you are unsure about the file format, please see the exemplary files below.

Variant_Type The type of the mutation, such as single base substitution (SNP), insersions (INS) or deletions (DEL). Please use the standard nomenclature used in MAF files. If you are unsure about the file format, please see the exemplary files below.

Reference_Allele The nucleotide expected in the Hg19 reference genome. Please note that if this mutation does not with the in the Hg19 genome, this mutation is ignored.

Tumor_Seq_Allele1 The alternative nucleotide 1 found in the reads.

Tumor_Seq_Allele2 The alternative nucleotide 2 found in the reads. Typically, either of these nucleotides is the reference allele (allele A), whereas the other column contains the alternative read (allele B). Different callers handle differently how they assign these columns. MutPanning will first look in Tumor_Seq_Allele1 whether this equals the reference allele. If so, will take the change in the second Tumor_Seq_Allele column as alternative read.

Tumor_Sample_Barcode The name of the tumor sample. These identifies have to be unique and be exactly the same as in the sample annotation file (case-sensitive).
sample.annotation.file (example sample annotation file)

This file organizes samples into subcohorts, e.g. cancer types. Samples in each subcohort are analyzed together for mutational significance. Each row in this file corresponds to an individual sample, which is annotated by the following columns.

Sample The same sample identifier as used in the mutation annotation file (case-sensitive). Note that these sample identifiers must be unique. Avoid special characters.

Cohort The cohort name, in which the samples should be analyzed together for significance. This is typically the cancer type, but you can also group your samples by other criteria (e.g., subtypes of cancer types or combine different cancer types together).

Output Files

MutPanning.zip

The output of MutPanning is a zipped folder, which contains a mutational significance report for each subcohort (e.g. cancer type). If the cohort size is too small to calibrate the Bayesian background model (typically for cancer types with a very low background mutation rate, cf. parameters), the folder contains two analysis reports, based on the Bayesian model and a uniform background model.

Each significance report is a tab-delimited txt file, which lists genes in descending order according to their mutational significance. Each row corresponds to an individual gene, which is annotated by the following columns:

Name Gene name

TargetSize The nonsynonymous target size of the gene corrected for nucleotide context bias (sum of likelihood coefficients of nonsynonymous positions).

TargetSizeSyn The synonymous target size of the gene corrected for nucleotide context bias (sum of likelihood coefficients of synonymous positions).

Count Number of nonsynonymous mutations

CountSyn Number of synonymous mutations

Significance Mutational significance of the gene according to MutPanning. This p-value considers the excess of nonsynonymous mutations, the increased number of mutations in unusual nucleotide contexts, as well as insertions and deletions.

FDR The MutPanning p-value corrected for multiple hypothesis testing (q-value, Benjamini-Hochberg procedure).

Example Data

If you are unsure about the file formats, you can download exemplary the following exemplary files:
Sample File
Mutation File

Requirements

MutPanning can be completely run as an module on GenePattern. Alternatively, it can be run locally on MacOS and Windows computers (8GB memory, 2 GHz CPU, 2 cores, 400MB free disk space required).

Platform Dependencies

Task Type:
Tutorial

CPU Type:
Any

Operating System:
Docker

Language:
Java, Python

Version Comments

Version	Release Date	Description
2	2019-08-27	Changing version number to 2.0 so that module and algorithm version numbers align

MutPanning (v2.0)