Accurate Genome-Wide Survival Analysis

Current implementations of the log-rank test (R survdiff, SAS LIFETEST, etc.) are based on an asymptotic approximation for the distribution of the log-rank statistic that is not appropriate when the two populations to be compared are unbalanced, as it is the case when testing the association of a mutation with survival in genomic studies. This asymptotic approximation results in p-values that can be very different from the exact p-values, up to 7 orders of magnitude, and a large number of false discoveries are reported because of this difference. We have designed and implemented a method, now called ExaLT (Exact Log-rank Test) to compute a conservative approximation of the exact p-value. In particular, our method computes the p-value for the exact permutational p-value, that is more appropriate for testing the association of mutations with survival.

  • ExaLT in C++, and R

  • A different p-value can be computed using a different null distribution (called conditional); while we suggest to compute the p-value from the permutational distribution (with the code above), we note that efficient implementations to compute the exact p-value from the conditional distribution are not available, and provide such an implementation in Matlab below:

  • exact conditional test in Matlab

  • For more information, contact Fabio Vandin at vandinfa [at]
    If you use our method in your research, please cite:

    F. Vandin, A. Papoutsaki, B.J. Raphael*, E.Upfal*. (2013) Genome-Wide Survival Analysis of Somatic Mutations in Cancer. 17th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2013). [Best Paper Award, RECOMB 2013] [Publisher Link]

    HotNet2: Network Analysis of Mutation Data
    See also: HotNet project page

    HotNet2 is an algorithm for the discovery of significantly mutated subnetworks in a protein-protein interaction network. HotNet2 uses an insulated heat diffusion model to simultaneously analyze both the mutations in and local topology of sets of proteins. We describe HotNet2 in a paper in submission.

    The pre-release of HotNet2 will be available soon. For more information or to become a beta-tester, contact Max Leiserson at mdml [at]

    Multi-Dendrix: (Multiple Pathway De novo Driver Exclusivity)
    Multi-Dendrix project page

    Multi-Dendrix is an algorithm for the simultaneous discovery of multiple driver pathways using only somatic mutation data from a cohort of samples. Multi-Dendrix uses an integer linear program to identify pathway sets such that each pathway contains genes with approximately mutually exclusive mutations and high coverage of the sample set. We describe Multi-Dendrix in a paper in submission:

    M.D.M. Leiserson, D. Blokh, R. Sharan, B.J. Raphael. (2012) Simultaneous identifcation of multiple driver pathways in cancer. [In submission]

    We have released Multi-Dendrix as a Python package that includes functions for subtype and network analysis of Multi-Dendrix results.

    Download the release on GitHub: Multi-Dendrix (Version 1.0, January 28, 2013)

    Dendrix: (De novo Driver Exclusivity)
    Dendrix project page

    Dendrix web server

    Dendrix is an algorithm for discovery of mutated driver pathways in cancer using only mutation data. It finds sets of genes, domains, or nucleotides whose mutations exhibit both high coverage and high exclusivity in the analyzed samples. This algorithm is described in the paper:

    F. Vandin, E. Upfal, B.J. Raphael. (2012) De novo Discovery of Mutated Driver Pathways in Cancer. \Genome Research. 22(2):375-85. Epub 2011 Jun 7. PDF Preprint Publisher Link [Preliminary version accepted at 15th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2011)]

    To download Dendrix see the Dendrix project page


    HotNet: Finding Altered Subnetworks
    Hotnet project page

    HotNet is an algorithm for finding significanlty altered subnetworks in a large gene interaction network. This algorithm is described in the paper:

    Vandin F, Upfal E, B.J. Raphael. (2011) Algorithms for Detecting Significantly Mutated Pathways in Cancer. Journal of Computational Biology. 18(3):507-22.

    [PDF] Publisher Link.

    [A preliminary version of the paper appeared at Proceedings of the 14th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2010). [PDF] ]

    To download HotNet see the Hotnet project page


    HotNet and Dendrix Visualization (Cytoscape plug-in)

    A Cytoscape plug-in for viewing HotNet and Dendrix results.


    NBC: Neighborhood Breakpoint Conservation

    This software finds recurrent rearrangement breakpoints in DNA copy number data. The algorithm is described in the paper:

    A. Ritz, P.L. Paris, M.M. Ittmann, C. Collins, and B.J. Raphael. (2011) Detection of Recurrent Rearrangement Breakpoints from Copy Number Data. BMC Bioinformatics. Publisher Link

    Gremlin: Genome Rearrangement Explorer with Multi-Scale, Linked Interactions:

    This is an interactive visualization model for the comparative analysis of structural variation in human and cancer genomes. The model is described in the following paper:

    T.M. O'Brien, A. Ritz, B.J. Raphael, and D.H. Laidlaw. (2010) Gremlin: An Interactive Visualization Model for Analyzing Genomic Rearrangements. IEEE Transactions on Visualization and Computer Graphics. vol.16, no.6, pp.918-926. Publisher Link

    Geometric Analysis of Structural Variants (GASV and GASVPro)

    Software for analysis of structural variation from paired-end sequencing and/or array-CGH data. This software has been tested used to find structural variation in both normal and cancer genomes using data from a variety of next-generation sequencing platforms. It can be used to predict structural variants directly from aligned reads in SAM/BAM format.


    GASVPro is a probabilistic version of our original GASV algorithm. GASVPro combines read depth information along with discordant paired-read mappings into a single probabilistic model two common signals of structural variation. When multiple alignments of a read are given, GASVPro utilizes a Markov Chain Monte Carlo procedure to sample over the space of possible alignments.

    GASVPro is availabile at the GASV GoogleCode site. Download.) We also provide an Example Data Set for analysis with GASVPro.

    The GASVPro algorithm is described in the following paper.

    S. Sindi, S. Onal, L. Peng, H. Wu and B.J. Raphael. (2012) An Integrative Model for Identification of Structural Variation in Sequencing Data. Genome Biology (In Press)


    The original GASV method is described in the following paper:

    S. Sindi, E. Helman, A. Bashir, B.J. Raphael. (2009) A Geometric Approach for
    Classification and Comparison of Structural Variants.Bioinformatics. 25: i222-i230. (Special issue for the Joint 17th Annual International Conference on Intelligent Systems in Molecular Biology and 8th Annual International European Conference on Computational Biology (ISMB/ECCB 09)). Publisher Link

    Old versions. These are for archival purposes. It is recommended to download the latest version from link above.

    • Version 1.4 (3/5/2010) . Download
    • Version 1.3 (1/19/2010) . Download
    • Example BAM file
    • Version 1.2 (11/30/2009) . Download: software
    • New in Version 1.4: Release notes.
    • New in Version 1.3: New output formats, streamlining of BAM file handling, bug fixes.
    • New in Version 1.2 (11/30/2009): Improved handling of SAM/BAM alignment files, speed improvements, maxCliqueSize option.
    • New in Version 1.1: a preprocessor for SAM/BAM files, aCGH comparison, fusion gene detection, and more.
    Motif Description Length (MoDL):

    MoDL finds mutliple motifs in a set of phosphorylated peptides, and is described in the following paper:

    A. Ritz, G. Shakhnarovich, A.R. Salomon, and B. Raphael. Discovery of Phosphorylation Motif Mixtures in Phosphoproteomics Data. (2009) Bioinformatics. 25(1):14-21. Publisher Link

    Paired-End Reconstruction of Genome Organization (PREGO):
    Structural Variation Project Page

    This algorithm reconstructs a cancer genome as a rearrangement of segments, or intervals, from the reference genome using paired end sequencing data. The algorithm is described in the following paper:

    L. Oesper, A. Ritz, S.J. Aerni, R. Drebin, and B.J. Raphael. (2012) Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics. 13(Suppl 6):S10. Publisher Link.

    [Preliminary version accepted at 2nd Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq)]

    CURRENT RELEASE: PREGO Version 1.2 (5/29/2013) Download

    Old versions. These are for archival purposes. It is recommended to download the latest version from the link above.

    Tumor Heterogeneity Analysis (THetA)
    THetA Project Page

    This algorithm estimates tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data. We describe this algorithm in the following paper:

    L. Oesper, A. Mahmoody, and B.J. Raphael. (2013) THetA: Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biology. 14:R80. [Publisher Link] [Supplemental Material]

    [Preliminary version accepted at 17th Annual International Conference on Research in Computational Molecular Biology (RECOMB). Extended Abstract]

    To download THetA please see the THetA project page.