################################################################# # Copyright (C) 2011 Benjamin Raphael and Anna Ritz # The files in this directory are part of NBC. # # NBC is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as # published by the Free Software Foundation, either version 3 of the # License, or (at your option) any later version. # NBC is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Lesser General Public License for more details. # You should have received a copy of the GNU Lesser General Public # License along with NBC. If not, see . # The copyrighted files are: # - runNBC.jar # - findBreakpointPairs.jar # - findSingleBreakpoints.jar # - ProbesToWindows.jar ################################################################# This document describes how to run NBC (Neighborhood Breakpoint Correlation) for aCGH data. Please cite the following reference: Detection of Recurrent Rearrangement Breakpoints from Copy Number Data BMC Bioinformatics, submitted Anna Ritz, Pamela L. Paris, Michael M. Ittmann, Colin Collins, Benjamin J. Raphael Corresponding Authors: Anna Ritz (aritz@cs.brown.edu) Ben Raphael (braphael@brown.edu) Version 1.0 (Oct. 1 2010) Written by Anna Ritz CONTENTS I. INSTALLATION II. PREPARING DATASETS III. RUNNING NBC A) Running the segmentation algorithm B) Scoring breakpoints IV. OUTPUT FILES V. EXAMPLE RUN VI. TROUBLESHOOTING Note that this code is optimized to run in parallel on a cluster. ================================================================== I. INSTALLATION Download and extract NBCcode.tgz. This directory contains the following files: - README - runNBC.jar (segmentation portion) - ProbesToWindows.jar (segmentation converter) - findSingleBreakpoints.jar - findBreakpointPairs.jar - runExample.bash - example_input/ (input files for example) - example_output/ (output files for example) ================================================================== II. PREPARING DATASETS To be able to run multiple jobs at the same time, the segmentation algorithm (runNBC.jar) requires a particular format for each dataset. First, each chromosomal arm is considered independently, so for each patient there are 48 different files that can be run in parallel. For given chromosome and arm (say, Chr1 ArmP), NBC requires the following files: - Chr1ArmP.tsv: ordered list of '\t'. Then, for each patient, we have - _Chr1ArmP.tsv: vector of for the patient, ordered according to Chr1ArmP.tsv. Each patient is assumed to have some value for each probe. Finally, we have a single set of genes or intervals of interest: - genefile: '\t\t\t\t' See the files in example_input/ for a reference. ================================================================== III. RUNNING NBC Running each program without any arguments produces the following usage information. A) Running the segmentation algorithm. To segment a particular input file, use runNBC.jar. USAGE: java -jar runNBC.jar INPUT : directory of input files INPUT : input .tsv file INPUT : directory for output file INPUT : number of standard deviations away that sig^2 and sig0^2 are before segments are detected (optional: default = 3) OUTPUT: /_mat: matrix of sampled segmentations. To convert the segmentation to windowed-based segmentations, use the ProbesToWindows.jar program: USAGE: java -jar ProbesToWindows.jar INPUT: : directory where matrix is INPUT: : directory where file of probes is INPUT: : chromosome INPUT: : chomosomal arm (P or Q) INPUT: file of RefSeq genes of interest. B) Scoring breakpoints. To find single breakpoints (either recurrent probe breakpoints or recurrent interval breakpoints), use findSingleBreakpoints.jar. USAGE: java -jar findSingleBreakpoints.jar INPUT: : input directory INPUT: : output direcotry INPUT: : chromosome INPUT: : chromosomal arm (P or Q) INPUT: : regular expression to filter TCGA patients. INPUT: : output file INPUT: : True for genes, False otherwise. OUTPUT: file of single breakpoint scores. To find pairs of breakpoints (either recurrent probe breakpoints or recurrent interval breakpoints), use findBreakpointPairs.jar. USAGE: java -jar findBreakpointPairs.jar INPUT: : input directory INPUT: : output direcotry INPUT: : chromosome A INPUT: : chromosomal arm A(P or Q) INPUT: : chromosome B INPUT: : chromosomal arm B (P or Q) INPUT: : regular expression to filter TCGA patients. INPUT: : output file INPUT: : True for genes, False otherwise. OUTPUT: file of breakpoint pair scores. ================================================================== IV. OUTPUT FILES The output files for runNBC.jar are: _mat: a binary file of 1000 segmentations sampled from the distribution of all segmentations given the data. This file is binary for speed. _mat.output: A file of '\t\t\y' where, each datapoint has a mean segmentation value, a signed probability, and an absolute probability. These files have the following columns: '\t\t The output file for the findSingleBreakpoints.jar and findBreakpointPairs.jar is a tab-delimited file for all breakpoints/pairs of breakpoints. These have not been corrected for multiple hypothesis testing. ================================================================== V. EXAMPLE RUN The example_input/ directory contains aCGH datasets (Chr7, ArmQ) from 6 patients that have a conserved breakpoint on PTPN12. These files are from TCGA and are publicly available. They are trimmed to the first 2000 probes for speed considerations; thus, the results from this example are different from the published results. Additionally, there is a 'Chr7ArmQ.tsv' file and a 'Chr7ArmQ_win.tsv' file that contains probe information and gene information, respectively. Note that segmented files are included in the example_output/ directory - these take a while (<10min) to generate, and if they are not deleted the segmentation algorithm will exit automatically. Only delete these if you want to resegment the data. To run the example, simply type 'bash runExample.bash'. ================================================================== VI. TROUBLESHOOTING In preparing datasets: - If there is a missing aCGH value for a probe, use 'NA' for this space. - The program relies on the file convention '_Chr1ArmQ.tsv' for the input datasets, and 'Chr1ArmQ.tsv' for the input probe file. In runNBC.jar: - The program might run out of memory if a large chromosomal arm is being run. To prevent memory issues, use 'java -Xms512m -Xmx2048m -jar runNBC.jar...'. - If a filename with a '_mat' suffix exists, then the program automatically quits. This avoids overwriting files that have already finished.