2010-04-20 07:00:08 +08:00
|
|
|
/*
|
2013-01-11 06:04:08 +08:00
|
|
|
* Copyright (c) 2012 The Broad Institute
|
|
|
|
|
*
|
|
|
|
|
* Permission is hereby granted, free of charge, to any person
|
|
|
|
|
* obtaining a copy of this software and associated documentation
|
|
|
|
|
* files (the "Software"), to deal in the Software without
|
|
|
|
|
* restriction, including without limitation the rights to use,
|
|
|
|
|
* copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
|
|
|
* copies of the Software, and to permit persons to whom the
|
|
|
|
|
* Software is furnished to do so, subject to the following
|
|
|
|
|
* conditions:
|
|
|
|
|
*
|
|
|
|
|
* The above copyright notice and this permission notice shall be
|
|
|
|
|
* included in all copies or substantial portions of the Software.
|
|
|
|
|
*
|
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
|
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
|
|
|
|
|
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
|
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
|
|
|
|
|
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
|
|
|
|
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
|
|
|
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
|
|
|
|
|
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
|
|
|
*/
|
2010-04-20 07:00:08 +08:00
|
|
|
|
2009-03-12 04:58:01 +08:00
|
|
|
package org.broadinstitute.sting.utils;
|
2009-02-27 05:50:29 +08:00
|
|
|
|
2013-01-31 05:41:23 +08:00
|
|
|
import com.google.java.contract.Ensures;
|
2012-03-29 00:55:29 +08:00
|
|
|
import com.google.java.contract.Requires;
|
Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.
Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows:
identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
for each base in the sequence from the starting vertex kmer:
look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue
If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue
This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers.
This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:
-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler
Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-04-18 20:17:15 +08:00
|
|
|
import net.sf.samtools.CigarOperator;
|
2012-03-17 02:09:07 +08:00
|
|
|
import net.sf.samtools.SAMFileHeader;
|
|
|
|
|
import net.sf.samtools.SAMProgramRecord;
|
2009-10-23 14:31:15 +08:00
|
|
|
import org.apache.log4j.Logger;
|
2012-03-17 02:09:07 +08:00
|
|
|
import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
|
|
|
|
|
import org.broadinstitute.sting.gatk.io.StingSAMFileWriter;
|
|
|
|
|
import org.broadinstitute.sting.utils.text.TextFormattingUtils;
|
2009-02-27 05:50:29 +08:00
|
|
|
|
2013-01-31 05:41:23 +08:00
|
|
|
import java.math.BigInteger;
|
2011-08-29 00:04:16 +08:00
|
|
|
import java.net.InetAddress;
|
2013-01-31 05:41:23 +08:00
|
|
|
import java.security.MessageDigest;
|
|
|
|
|
import java.security.NoSuchAlgorithmException;
|
2011-07-18 08:29:58 +08:00
|
|
|
import java.util.*;
|
|
|
|
|
|
2009-02-27 05:50:29 +08:00
|
|
|
/**
|
|
|
|
|
* Created by IntelliJ IDEA.
|
|
|
|
|
* User: depristo
|
|
|
|
|
* Date: Feb 24, 2009
|
|
|
|
|
* Time: 10:12:31 AM
|
|
|
|
|
* To change this template use File | Settings | File Templates.
|
|
|
|
|
*/
|
|
|
|
|
public class Utils {
|
2009-10-23 14:31:15 +08:00
|
|
|
/** our log, which we want to capture anything from this class */
|
2009-06-05 23:02:17 +08:00
|
|
|
private static Logger logger = Logger.getLogger(Utils.class);
|
2009-03-27 22:02:55 +08:00
|
|
|
|
2011-08-03 09:59:06 +08:00
|
|
|
public static final float JAVA_DEFAULT_HASH_LOAD_FACTOR = 0.75f;
|
|
|
|
|
|
2013-02-23 04:22:43 +08:00
|
|
|
/**
|
|
|
|
|
* Boolean xor operation. Only true if x != y.
|
|
|
|
|
*
|
|
|
|
|
* @param x a boolean
|
|
|
|
|
* @param y a boolean
|
|
|
|
|
* @return true if x != y
|
|
|
|
|
*/
|
|
|
|
|
public static boolean xor(final boolean x, final boolean y) {
|
|
|
|
|
return x != y;
|
|
|
|
|
}
|
|
|
|
|
|
2011-08-03 09:59:06 +08:00
|
|
|
/**
|
|
|
|
|
* Calculates the optimum initial size for a hash table given the maximum number
|
|
|
|
|
* of elements it will need to hold. The optimum size is the smallest size that
|
|
|
|
|
* is guaranteed not to result in any rehash/table-resize operations.
|
|
|
|
|
*
|
|
|
|
|
* @param maxElements The maximum number of elements you expect the hash table
|
|
|
|
|
* will need to hold
|
|
|
|
|
* @return The optimum initial size for the table, given maxElements
|
|
|
|
|
*/
|
|
|
|
|
public static int optimumHashSize ( int maxElements ) {
|
|
|
|
|
return (int)(maxElements / JAVA_DEFAULT_HASH_LOAD_FACTOR) + 2;
|
|
|
|
|
}
|
|
|
|
|
|
2009-08-23 08:56:02 +08:00
|
|
|
/**
|
|
|
|
|
* Compares two objects, either of which might be null.
|
2009-10-23 14:31:15 +08:00
|
|
|
*
|
2009-08-23 08:56:02 +08:00
|
|
|
* @param lhs One object to compare.
|
|
|
|
|
* @param rhs The other object to compare.
|
2009-10-23 14:31:15 +08:00
|
|
|
*
|
2009-08-23 08:56:02 +08:00
|
|
|
* @return True if the two objects are equal, false otherwise.
|
|
|
|
|
*/
|
|
|
|
|
public static boolean equals(Object lhs, Object rhs) {
|
2013-03-29 22:10:13 +08:00
|
|
|
return lhs == null && rhs == null || lhs != null && lhs.equals(rhs);
|
2009-08-23 08:56:02 +08:00
|
|
|
}
|
|
|
|
|
|
2009-06-06 07:34:37 +08:00
|
|
|
public static <T> List<T> cons(final T elt, final List<T> l) {
|
|
|
|
|
List<T> l2 = new ArrayList<T>();
|
|
|
|
|
l2.add(elt);
|
2009-10-23 14:31:15 +08:00
|
|
|
if (l != null) l2.addAll(l);
|
2009-06-06 07:34:37 +08:00
|
|
|
return l2;
|
|
|
|
|
}
|
|
|
|
|
|
2010-09-10 07:21:17 +08:00
|
|
|
public static void warnUser(final String msg) {
|
2011-10-27 11:05:41 +08:00
|
|
|
warnUser(logger, msg);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static void warnUser(final Logger logger, final String msg) {
|
2010-09-10 07:21:17 +08:00
|
|
|
logger.warn(String.format("********************************************************************************"));
|
|
|
|
|
logger.warn(String.format("* WARNING:"));
|
|
|
|
|
logger.warn(String.format("*"));
|
2011-10-27 11:05:41 +08:00
|
|
|
prettyPrintWarningMessage(logger, msg);
|
2010-09-10 07:21:17 +08:00
|
|
|
logger.warn(String.format("********************************************************************************"));
|
|
|
|
|
}
|
|
|
|
|
|
2009-06-05 23:49:03 +08:00
|
|
|
/**
|
|
|
|
|
* pretty print the warning message supplied
|
2009-10-23 14:31:15 +08:00
|
|
|
*
|
2011-10-27 11:05:41 +08:00
|
|
|
* @param logger logger for the message
|
2009-06-05 23:49:03 +08:00
|
|
|
* @param message the message
|
|
|
|
|
*/
|
2011-10-27 11:05:41 +08:00
|
|
|
private static void prettyPrintWarningMessage(Logger logger, String message) {
|
2009-06-05 23:49:03 +08:00
|
|
|
StringBuilder builder = new StringBuilder(message);
|
|
|
|
|
while (builder.length() > 70) {
|
|
|
|
|
int space = builder.lastIndexOf(" ", 70);
|
|
|
|
|
if (space <= 0) space = 70;
|
2009-10-23 14:31:15 +08:00
|
|
|
logger.warn(String.format("* %s", builder.substring(0, space)));
|
|
|
|
|
builder.delete(0, space + 1);
|
2009-06-05 23:49:03 +08:00
|
|
|
}
|
|
|
|
|
logger.warn(String.format("* %s", builder));
|
|
|
|
|
}
|
|
|
|
|
|
2009-11-18 00:50:01 +08:00
|
|
|
/**
|
|
|
|
|
* join the key value pairs of a map into one string, i.e. myMap = [A->1,B->2,C->3] with a call of:
|
|
|
|
|
* joinMap("-","*",myMap) -> returns A-1*B-2*C-3
|
|
|
|
|
*
|
|
|
|
|
* Be forewarned, if you're not using a map that is aware of the ordering (i.e. HashMap instead of LinkedHashMap)
|
|
|
|
|
* the ordering of the string you get back might not be what you expect! (i.e. C-3*A-1*B-2 vrs A-1*B-2*C-3)
|
|
|
|
|
*
|
|
|
|
|
* @param keyValueSeperator the string to seperate the key-value pairs
|
|
|
|
|
* @param recordSeperator the string to use to seperate each key-value pair from other key-value pairs
|
|
|
|
|
* @param map the map to draw from
|
|
|
|
|
* @param <L> the map's key type
|
|
|
|
|
* @param <R> the map's value type
|
|
|
|
|
* @return a string representing the joined map
|
|
|
|
|
*/
|
|
|
|
|
public static <L,R> String joinMap(String keyValueSeperator, String recordSeperator, Map<L,R> map) {
|
|
|
|
|
if (map.size() < 1) { return null; }
|
|
|
|
|
String joinedKeyValues[] = new String[map.size()];
|
|
|
|
|
int index = 0;
|
|
|
|
|
for (L key : map.keySet()) {
|
|
|
|
|
joinedKeyValues[index++] = String.format("%s%s%s",key.toString(),keyValueSeperator,map.get(key).toString());
|
|
|
|
|
}
|
|
|
|
|
return join(recordSeperator,joinedKeyValues);
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-19 11:37:26 +08:00
|
|
|
/**
|
|
|
|
|
* Splits a String using indexOf instead of regex to speed things up.
|
|
|
|
|
*
|
|
|
|
|
* @param str the string to split.
|
|
|
|
|
* @param delimiter the delimiter used to split the string.
|
|
|
|
|
* @return an array of tokens.
|
|
|
|
|
*/
|
|
|
|
|
public static ArrayList<String> split(String str, String delimiter) {
|
|
|
|
|
return split(str, delimiter, 10);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Splits a String using indexOf instead of regex to speed things up.
|
|
|
|
|
*
|
|
|
|
|
* @param str the string to split.
|
|
|
|
|
* @param delimiter the delimiter used to split the string.
|
|
|
|
|
* @param expectedNumTokens The number of tokens expected. This is used to initialize the ArrayList.
|
|
|
|
|
* @return an array of tokens.
|
|
|
|
|
*/
|
|
|
|
|
public static ArrayList<String> split(String str, String delimiter, int expectedNumTokens) {
|
|
|
|
|
final ArrayList<String> result = new ArrayList<String>(expectedNumTokens);
|
|
|
|
|
|
|
|
|
|
int delimiterIdx = -1;
|
|
|
|
|
do {
|
|
|
|
|
final int tokenStartIdx = delimiterIdx + 1;
|
|
|
|
|
delimiterIdx = str.indexOf(delimiter, tokenStartIdx);
|
|
|
|
|
final String token = (delimiterIdx != -1 ? str.substring(tokenStartIdx, delimiterIdx) : str.substring(tokenStartIdx) );
|
|
|
|
|
result.add(token);
|
|
|
|
|
} while( delimiterIdx != -1 );
|
|
|
|
|
|
|
|
|
|
return result;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
2009-11-18 00:50:01 +08:00
|
|
|
/**
|
|
|
|
|
* join an array of strings given a seperator
|
|
|
|
|
* @param separator the string to insert between each array element
|
|
|
|
|
* @param strings the array of strings
|
|
|
|
|
* @return a string, which is the joining of all array values with the separator
|
|
|
|
|
*/
|
2009-02-28 01:07:57 +08:00
|
|
|
public static String join(String separator, String[] strings) {
|
2009-04-02 06:54:38 +08:00
|
|
|
return join(separator, strings, 0, strings.length);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static String join(String separator, String[] strings, int start, int end) {
|
|
|
|
|
if ((end - start) == 0) {
|
2009-02-28 01:07:57 +08:00
|
|
|
return "";
|
|
|
|
|
}
|
2009-04-02 06:54:38 +08:00
|
|
|
StringBuilder ret = new StringBuilder(strings[start]);
|
2009-10-23 14:31:15 +08:00
|
|
|
for (int i = start + 1; i < end; ++i) {
|
2009-02-28 01:07:57 +08:00
|
|
|
ret.append(separator);
|
|
|
|
|
ret.append(strings[i]);
|
|
|
|
|
}
|
|
|
|
|
return ret.toString();
|
|
|
|
|
}
|
2009-03-03 02:18:48 +08:00
|
|
|
|
2012-06-02 07:25:11 +08:00
|
|
|
public static String join(String separator, int[] ints) {
|
|
|
|
|
if ( ints == null || ints.length == 0)
|
|
|
|
|
return "";
|
|
|
|
|
else {
|
Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-11 08:13:18 +08:00
|
|
|
StringBuilder ret = new StringBuilder();
|
|
|
|
|
ret.append(ints[0]);
|
2012-06-02 07:25:11 +08:00
|
|
|
for (int i = 1; i < ints.length; ++i) {
|
|
|
|
|
ret.append(separator);
|
|
|
|
|
ret.append(ints[i]);
|
|
|
|
|
}
|
|
|
|
|
return ret.toString();
|
2012-10-02 21:39:51 +08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2013-02-01 06:46:18 +08:00
|
|
|
/**
|
|
|
|
|
* Create a new list that contains the elements of left along with elements elts
|
|
|
|
|
* @param left a non-null list of elements
|
|
|
|
|
* @param elts a varargs vector for elts to append in order to left
|
|
|
|
|
* @return A newly allocated linked list containing left followed by elts
|
|
|
|
|
*/
|
2012-10-03 05:27:09 +08:00
|
|
|
public static <T> List<T> append(final List<T> left, T ... elts) {
|
|
|
|
|
final List<T> l = new LinkedList<T>(left);
|
|
|
|
|
l.addAll(Arrays.asList(elts));
|
|
|
|
|
return l;
|
|
|
|
|
}
|
|
|
|
|
|
2012-10-02 21:39:51 +08:00
|
|
|
/**
|
|
|
|
|
* Returns a string of the values in joined by separator, such as A,B,C
|
|
|
|
|
*
|
2013-03-29 22:10:13 +08:00
|
|
|
* @param separator separator character
|
|
|
|
|
* @param doubles the array with values
|
|
|
|
|
* @return a string with the values separated by the separator
|
2012-10-02 21:39:51 +08:00
|
|
|
*/
|
|
|
|
|
public static String join(String separator, double[] doubles) {
|
|
|
|
|
if ( doubles == null || doubles.length == 0)
|
|
|
|
|
return "";
|
|
|
|
|
else {
|
|
|
|
|
StringBuilder ret = new StringBuilder();
|
|
|
|
|
ret.append(doubles[0]);
|
|
|
|
|
for (int i = 1; i < doubles.length; ++i) {
|
|
|
|
|
ret.append(separator);
|
|
|
|
|
ret.append(doubles[i]);
|
|
|
|
|
}
|
|
|
|
|
return ret.toString();
|
2012-06-02 07:25:11 +08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2011-09-21 22:15:05 +08:00
|
|
|
/**
|
|
|
|
|
* Returns a string of the form elt1.toString() [sep elt2.toString() ... sep elt.toString()] for a collection of
|
|
|
|
|
* elti objects (note there's no actual space between sep and the elti elements). Returns
|
|
|
|
|
* "" if collection is empty. If collection contains just elt, then returns elt.toString()
|
|
|
|
|
*
|
|
|
|
|
* @param separator the string to use to separate objects
|
|
|
|
|
* @param objects a collection of objects. the element order is defined by the iterator over objects
|
|
|
|
|
* @param <T> the type of the objects
|
|
|
|
|
* @return a non-null string
|
|
|
|
|
*/
|
|
|
|
|
public static <T> String join(final String separator, final Collection<T> objects) {
|
|
|
|
|
if (objects.isEmpty()) { // fast path for empty collection
|
2010-05-19 11:37:26 +08:00
|
|
|
return "";
|
2011-09-21 22:15:05 +08:00
|
|
|
} else {
|
|
|
|
|
final Iterator<T> iter = objects.iterator();
|
|
|
|
|
final T first = iter.next();
|
|
|
|
|
|
|
|
|
|
if ( ! iter.hasNext() ) // fast path for singleton collections
|
|
|
|
|
return first.toString();
|
|
|
|
|
else { // full path for 2+ collection that actually need a join
|
|
|
|
|
final StringBuilder ret = new StringBuilder(first.toString());
|
|
|
|
|
while(iter.hasNext()) {
|
|
|
|
|
ret.append(separator);
|
|
|
|
|
ret.append(iter.next().toString());
|
|
|
|
|
}
|
|
|
|
|
return ret.toString();
|
|
|
|
|
}
|
2010-05-19 11:37:26 +08:00
|
|
|
}
|
2009-03-03 02:18:48 +08:00
|
|
|
}
|
|
|
|
|
|
NA12878 knowledge base backed by MongoDB
-- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878. Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development
-- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK
-- Not hooked up to Tribble, but could be done reasonably easily now (future TODO)
-- Tools to import callsets, create consensus callsets, import and export reviews
-- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand)
-- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set
-- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters
-- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus.
-- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB.
-- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from. Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker)
UnitTests
-- Can connect to a test knowledge base for development and unit testing
-- PolymorphicStatus, TruthStatus, SiteIterator
-- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc
-- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb
AssessNA12878
-- Generic tool for comparing a NA12878 callset against the knowledge base. See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation
-- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs
Misc
-- Ability to provide Description for Simplified GATK report
2012-11-05 06:40:17 +08:00
|
|
|
public static <T> String join(final String separator, final T ... objects) {
|
|
|
|
|
return join(separator, Arrays.asList(objects));
|
|
|
|
|
}
|
|
|
|
|
|
Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-07 11:14:23 +08:00
|
|
|
/**
|
|
|
|
|
* Create a new string thats a n duplicate copies of s
|
|
|
|
|
* @param s the string to duplicate
|
|
|
|
|
* @param nCopies how many copies?
|
|
|
|
|
* @return a string
|
|
|
|
|
*/
|
|
|
|
|
public static String dupString(final String s, int nCopies) {
|
|
|
|
|
if ( s == null || s.equals("") ) throw new IllegalArgumentException("Bad s " + s);
|
2013-04-24 06:28:41 +08:00
|
|
|
if ( nCopies < 0 ) throw new IllegalArgumentException("nCopies must be >= 0 but got " + nCopies);
|
Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-07 11:14:23 +08:00
|
|
|
|
|
|
|
|
final StringBuilder b = new StringBuilder();
|
|
|
|
|
for ( int i = 0; i < nCopies; i++ )
|
|
|
|
|
b.append(s);
|
|
|
|
|
return b.toString();
|
|
|
|
|
}
|
|
|
|
|
|
2009-10-23 14:31:15 +08:00
|
|
|
public static String dupString(char c, int nCopies) {
|
2009-05-22 06:23:52 +08:00
|
|
|
char[] chars = new char[nCopies];
|
2009-10-23 14:31:15 +08:00
|
|
|
Arrays.fill(chars, c);
|
2009-05-22 06:23:52 +08:00
|
|
|
return new String(chars);
|
|
|
|
|
}
|
2009-05-08 02:03:49 +08:00
|
|
|
|
2010-03-25 02:17:56 +08:00
|
|
|
public static byte[] dupBytes(byte b, int nCopies) {
|
|
|
|
|
byte[] bytes = new byte[nCopies];
|
|
|
|
|
Arrays.fill(bytes, b);
|
|
|
|
|
return bytes;
|
|
|
|
|
}
|
|
|
|
|
|
2010-01-08 01:51:41 +08:00
|
|
|
// trim a string for the given character (i.e. not just whitespace)
|
|
|
|
|
public static String trim(String str, char ch) {
|
|
|
|
|
char[] array = str.toCharArray();
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
int start = 0;
|
|
|
|
|
while ( start < array.length && array[start] == ch )
|
|
|
|
|
start++;
|
|
|
|
|
|
|
|
|
|
int end = array.length - 1;
|
|
|
|
|
while ( end > start && array[end] == ch )
|
|
|
|
|
end--;
|
|
|
|
|
|
|
|
|
|
return str.substring(start, end+1);
|
|
|
|
|
}
|
|
|
|
|
|
2010-11-13 04:14:28 +08:00
|
|
|
/**
|
|
|
|
|
* Splits expressions in command args by spaces and returns the array of expressions.
|
|
|
|
|
* Expressions may use single or double quotes to group any individual expression, but not both.
|
|
|
|
|
* @param args Arguments to parse.
|
|
|
|
|
* @return Parsed expressions.
|
|
|
|
|
*/
|
|
|
|
|
public static String[] escapeExpressions(String args) {
|
|
|
|
|
// special case for ' and " so we can allow expressions
|
|
|
|
|
if (args.indexOf('\'') != -1)
|
|
|
|
|
return escapeExpressions(args, "'");
|
|
|
|
|
else if (args.indexOf('\"') != -1)
|
|
|
|
|
return escapeExpressions(args, "\"");
|
|
|
|
|
else
|
2011-03-26 08:41:47 +08:00
|
|
|
return args.trim().split(" +");
|
2010-11-13 04:14:28 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Splits expressions in command args by spaces and the supplied delimiter and returns the array of expressions.
|
|
|
|
|
* @param args Arguments to parse.
|
|
|
|
|
* @param delimiter Delimiter for grouping expressions.
|
|
|
|
|
* @return Parsed expressions.
|
|
|
|
|
*/
|
|
|
|
|
private static String[] escapeExpressions(String args, String delimiter) {
|
|
|
|
|
String[] command = {};
|
|
|
|
|
String[] split = args.split(delimiter);
|
2010-11-23 06:59:42 +08:00
|
|
|
String arg;
|
2010-11-13 04:14:28 +08:00
|
|
|
for (int i = 0; i < split.length - 1; i += 2) {
|
2010-11-23 06:59:42 +08:00
|
|
|
arg = split[i].trim();
|
|
|
|
|
if (arg.length() > 0) // if the unescaped arg has a size
|
2011-03-26 08:41:47 +08:00
|
|
|
command = Utils.concatArrays(command, arg.split(" +"));
|
2010-11-13 04:14:28 +08:00
|
|
|
command = Utils.concatArrays(command, new String[]{split[i + 1]});
|
|
|
|
|
}
|
2010-11-23 06:59:42 +08:00
|
|
|
arg = split[split.length - 1].trim();
|
|
|
|
|
if (split.length % 2 == 1) // if the command ends with a delimiter
|
|
|
|
|
if (arg.length() > 0) // if the last unescaped arg has a size
|
2011-03-26 08:41:47 +08:00
|
|
|
command = Utils.concatArrays(command, arg.split(" +"));
|
2010-11-23 06:59:42 +08:00
|
|
|
return command;
|
2010-11-13 04:14:28 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Concatenates two String arrays.
|
|
|
|
|
* @param A First array.
|
|
|
|
|
* @param B Second array.
|
|
|
|
|
* @return Concatenation of A then B.
|
|
|
|
|
*/
|
2009-11-24 10:34:48 +08:00
|
|
|
public static String[] concatArrays(String[] A, String[] B) {
|
|
|
|
|
String[] C = new String[A.length + B.length];
|
|
|
|
|
System.arraycopy(A, 0, C, 0, A.length);
|
|
|
|
|
System.arraycopy(B, 0, C, A.length, B.length);
|
|
|
|
|
return C;
|
|
|
|
|
}
|
|
|
|
|
|
2013-02-11 11:21:26 +08:00
|
|
|
/**
|
|
|
|
|
* Concatenates byte arrays
|
|
|
|
|
* @return a concat of all bytes in allBytes in order
|
|
|
|
|
*/
|
|
|
|
|
public static byte[] concat(final byte[] ... allBytes) {
|
|
|
|
|
int size = 0;
|
|
|
|
|
for ( final byte[] bytes : allBytes ) size += bytes.length;
|
|
|
|
|
|
|
|
|
|
final byte[] c = new byte[size];
|
|
|
|
|
int offset = 0;
|
|
|
|
|
for ( final byte[] bytes : allBytes ) {
|
|
|
|
|
System.arraycopy(bytes, 0, c, offset, bytes.length);
|
|
|
|
|
offset += bytes.length;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return c;
|
|
|
|
|
}
|
|
|
|
|
|
2013-02-01 14:18:51 +08:00
|
|
|
/**
|
|
|
|
|
* Appends String(s) B to array A.
|
|
|
|
|
* @param A First array.
|
|
|
|
|
* @param B Strings to append.
|
|
|
|
|
* @return A with B(s) appended.
|
|
|
|
|
*/
|
|
|
|
|
public static String[] appendArray(String[] A, String... B) {
|
|
|
|
|
return concatArrays(A, B);
|
|
|
|
|
}
|
2009-03-18 03:06:40 +08:00
|
|
|
|
2010-02-05 23:42:54 +08:00
|
|
|
public static <T extends Comparable<T>> List<T> sorted(Collection<T> c) {
|
2010-02-10 03:02:25 +08:00
|
|
|
return sorted(c, false);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static <T extends Comparable<T>> List<T> sorted(Collection<T> c, boolean reverse) {
|
2010-02-05 23:42:54 +08:00
|
|
|
List<T> l = new ArrayList<T>(c);
|
|
|
|
|
Collections.sort(l);
|
2010-02-10 03:02:25 +08:00
|
|
|
if ( reverse ) Collections.reverse(l);
|
2010-02-05 23:42:54 +08:00
|
|
|
return l;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static <T extends Comparable<T>, V> List<V> sorted(Map<T,V> c) {
|
2010-02-10 03:02:25 +08:00
|
|
|
return sorted(c, false);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static <T extends Comparable<T>, V> List<V> sorted(Map<T,V> c, boolean reverse) {
|
2010-02-05 23:42:54 +08:00
|
|
|
List<T> t = new ArrayList<T>(c.keySet());
|
|
|
|
|
Collections.sort(t);
|
2010-02-10 03:02:25 +08:00
|
|
|
if ( reverse ) Collections.reverse(t);
|
2010-02-05 23:42:54 +08:00
|
|
|
|
|
|
|
|
List<V> l = new ArrayList<V>();
|
|
|
|
|
for ( T k : t ) {
|
|
|
|
|
l.add(c.get(k));
|
|
|
|
|
}
|
|
|
|
|
return l;
|
|
|
|
|
}
|
2009-10-23 14:31:15 +08:00
|
|
|
|
2010-05-20 22:05:13 +08:00
|
|
|
/**
|
|
|
|
|
* Reverse a byte array of bases
|
|
|
|
|
*
|
|
|
|
|
* @param bases the byte array of bases
|
|
|
|
|
* @return the reverse of the base byte array
|
|
|
|
|
*/
|
|
|
|
|
static public byte[] reverse(byte[] bases) {
|
|
|
|
|
byte[] rcbases = new byte[bases.length];
|
|
|
|
|
|
|
|
|
|
for (int i = 0; i < bases.length; i++) {
|
|
|
|
|
rcbases[i] = bases[bases.length - i - 1];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return rcbases;
|
|
|
|
|
}
|
|
|
|
|
|
2013-03-29 22:10:13 +08:00
|
|
|
static public <T> List<T> reverse(final List<T> l) {
|
2011-11-02 22:49:40 +08:00
|
|
|
final List<T> newL = new ArrayList<T>(l);
|
|
|
|
|
Collections.reverse(newL);
|
|
|
|
|
return newL;
|
|
|
|
|
}
|
|
|
|
|
|
2010-05-20 22:05:13 +08:00
|
|
|
/**
|
|
|
|
|
* Reverse an int array of bases
|
|
|
|
|
*
|
|
|
|
|
* @param bases the int array of bases
|
|
|
|
|
* @return the reverse of the base int array
|
|
|
|
|
*/
|
|
|
|
|
static public int[] reverse(int[] bases) {
|
|
|
|
|
int[] rcbases = new int[bases.length];
|
|
|
|
|
|
|
|
|
|
for (int i = 0; i < bases.length; i++) {
|
|
|
|
|
rcbases[i] = bases[bases.length - i - 1];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return rcbases;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Reverse (NOT reverse-complement!!) a string
|
|
|
|
|
*
|
|
|
|
|
* @param bases input string
|
|
|
|
|
* @return the reversed string
|
|
|
|
|
*/
|
|
|
|
|
static public String reverse(String bases) {
|
|
|
|
|
return new String( reverse( bases.getBytes() )) ;
|
|
|
|
|
}
|
|
|
|
|
|
2011-01-21 06:34:43 +08:00
|
|
|
public static boolean isFlagSet(int value, int flag) {
|
|
|
|
|
return ((value & flag) == flag);
|
|
|
|
|
}
|
2011-08-29 00:04:16 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Helper utility that calls into the InetAddress system to resolve the hostname. If this fails,
|
|
|
|
|
* unresolvable gets returned instead.
|
|
|
|
|
*/
|
2013-03-29 22:10:13 +08:00
|
|
|
public static String resolveHostname() {
|
2011-08-29 00:04:16 +08:00
|
|
|
try {
|
|
|
|
|
return InetAddress.getLocalHost().getCanonicalHostName();
|
|
|
|
|
}
|
|
|
|
|
catch (java.net.UnknownHostException uhe) { // [beware typo in code sample -dmw]
|
|
|
|
|
return "unresolvable";
|
|
|
|
|
// handle exception
|
|
|
|
|
}
|
|
|
|
|
}
|
2011-12-16 02:09:46 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
public static byte [] arrayFromArrayWithLength(byte[] array, int length) {
|
|
|
|
|
byte [] output = new byte[length];
|
|
|
|
|
for (int j = 0; j < length; j++)
|
|
|
|
|
output[j] = array[(j % array.length)];
|
|
|
|
|
return output;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static void fillArrayWithByte(byte[] array, byte value) {
|
|
|
|
|
for (int i=0; i<array.length; i++)
|
|
|
|
|
array[i] = value;
|
|
|
|
|
}
|
|
|
|
|
|
2012-11-14 04:21:57 +08:00
|
|
|
/**
|
|
|
|
|
* Creates a program record for the program, adds it to the list of program records (@PG tags) in the bam file and sets
|
|
|
|
|
* up the writer with the header and presorted status.
|
|
|
|
|
*
|
|
|
|
|
* @param originalHeader original header
|
|
|
|
|
* @param programRecord the program record for this program
|
|
|
|
|
*/
|
2013-03-29 22:10:13 +08:00
|
|
|
public static SAMFileHeader setupWriter(final SAMFileHeader originalHeader, final SAMProgramRecord programRecord) {
|
|
|
|
|
final SAMFileHeader header = originalHeader.clone();
|
|
|
|
|
final List<SAMProgramRecord> oldRecords = header.getProgramRecords();
|
|
|
|
|
final List<SAMProgramRecord> newRecords = new ArrayList<SAMProgramRecord>(oldRecords.size()+1);
|
2012-03-17 02:09:07 +08:00
|
|
|
for ( SAMProgramRecord record : oldRecords )
|
2013-03-29 22:10:13 +08:00
|
|
|
if ( (programRecord != null && !record.getId().startsWith(programRecord.getId())))
|
2012-03-17 02:09:07 +08:00
|
|
|
newRecords.add(record);
|
|
|
|
|
|
2012-11-27 00:12:27 +08:00
|
|
|
if (programRecord != null) {
|
|
|
|
|
newRecords.add(programRecord);
|
|
|
|
|
header.setProgramRecords(newRecords);
|
|
|
|
|
}
|
2012-11-14 04:21:57 +08:00
|
|
|
return header;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Creates a program record for the program, adds it to the list of program records (@PG tags) in the bam file and returns
|
|
|
|
|
* the new header to be added to the BAM writer.
|
|
|
|
|
*
|
|
|
|
|
* @param toolkit the engine
|
|
|
|
|
* @param walker the walker object (so we can extract the command line)
|
|
|
|
|
* @param PROGRAM_RECORD_NAME the name for the PG tag
|
|
|
|
|
* @return a pre-filled header for the bam writer
|
|
|
|
|
*/
|
2013-03-29 22:10:13 +08:00
|
|
|
public static SAMFileHeader setupWriter(final GenomeAnalysisEngine toolkit, final SAMFileHeader originalHeader, final Object walker, final String PROGRAM_RECORD_NAME) {
|
2012-11-14 04:21:57 +08:00
|
|
|
final SAMProgramRecord programRecord = createProgramRecord(toolkit, walker, PROGRAM_RECORD_NAME);
|
2013-03-29 22:10:13 +08:00
|
|
|
return setupWriter(originalHeader, programRecord);
|
2012-11-14 04:21:57 +08:00
|
|
|
}
|
2012-03-17 02:09:07 +08:00
|
|
|
|
2012-11-14 04:21:57 +08:00
|
|
|
/**
|
|
|
|
|
* Creates a program record for the program, adds it to the list of program records (@PG tags) in the bam file and sets
|
|
|
|
|
* up the writer with the header and presorted status.
|
|
|
|
|
*
|
|
|
|
|
* @param writer BAM file writer
|
|
|
|
|
* @param toolkit the engine
|
|
|
|
|
* @param preSorted whether or not the writer can assume reads are going to be added are already sorted
|
|
|
|
|
* @param walker the walker object (so we can extract the command line)
|
|
|
|
|
* @param PROGRAM_RECORD_NAME the name for the PG tag
|
|
|
|
|
*/
|
2013-03-29 22:10:13 +08:00
|
|
|
public static void setupWriter(StingSAMFileWriter writer, GenomeAnalysisEngine toolkit, SAMFileHeader originalHeader, boolean preSorted, Object walker, String PROGRAM_RECORD_NAME) {
|
|
|
|
|
SAMFileHeader header = setupWriter(toolkit, originalHeader, walker, PROGRAM_RECORD_NAME);
|
2012-03-17 02:09:07 +08:00
|
|
|
writer.writeHeader(header);
|
|
|
|
|
writer.setPresorted(preSorted);
|
|
|
|
|
}
|
2012-11-14 04:21:57 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Creates a program record (@PG) tag
|
|
|
|
|
*
|
|
|
|
|
* @param toolkit the engine
|
|
|
|
|
* @param walker the walker object (so we can extract the command line)
|
|
|
|
|
* @param PROGRAM_RECORD_NAME the name for the PG tag
|
|
|
|
|
* @return a program record for the tool
|
|
|
|
|
*/
|
2012-03-17 02:09:07 +08:00
|
|
|
public static SAMProgramRecord createProgramRecord(GenomeAnalysisEngine toolkit, Object walker, String PROGRAM_RECORD_NAME) {
|
|
|
|
|
final SAMProgramRecord programRecord = new SAMProgramRecord(PROGRAM_RECORD_NAME);
|
|
|
|
|
final ResourceBundle headerInfo = TextFormattingUtils.loadResourceBundle("StingText");
|
|
|
|
|
try {
|
|
|
|
|
final String version = headerInfo.getString("org.broadinstitute.sting.gatk.version");
|
|
|
|
|
programRecord.setProgramVersion(version);
|
|
|
|
|
} catch (MissingResourceException e) {
|
|
|
|
|
// couldn't care less if the resource is missing...
|
|
|
|
|
}
|
|
|
|
|
programRecord.setCommandLine(toolkit.createApproximateCommandLineArgumentString(toolkit, walker));
|
|
|
|
|
return programRecord;
|
|
|
|
|
}
|
|
|
|
|
|
2012-03-29 00:55:29 +08:00
|
|
|
/**
|
|
|
|
|
* Returns the number of combinations represented by this collection
|
|
|
|
|
* of collection of options.
|
|
|
|
|
*
|
|
|
|
|
* For example, if this is [[A, B], [C, D], [E, F, G]] returns 2 * 2 * 3 = 12
|
|
|
|
|
*/
|
|
|
|
|
@Requires("options != null")
|
|
|
|
|
public static <T> int nCombinations(final Collection<T>[] options) {
|
|
|
|
|
int nStates = 1;
|
|
|
|
|
for ( Collection<T> states : options ) {
|
|
|
|
|
nStates *= states.size();
|
|
|
|
|
}
|
|
|
|
|
return nStates;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
@Requires("options != null")
|
|
|
|
|
public static <T> int nCombinations(final List<List<T>> options) {
|
|
|
|
|
if ( options.isEmpty() )
|
|
|
|
|
return 0;
|
|
|
|
|
else {
|
|
|
|
|
int nStates = 1;
|
|
|
|
|
for ( Collection<T> states : options ) {
|
|
|
|
|
nStates *= states.size();
|
|
|
|
|
}
|
|
|
|
|
return nStates;
|
|
|
|
|
}
|
|
|
|
|
}
|
2012-03-31 03:28:47 +08:00
|
|
|
|
2012-08-15 03:02:45 +08:00
|
|
|
/**
|
|
|
|
|
* Make all combinations of N size of objects
|
|
|
|
|
*
|
|
|
|
|
* if objects = [A, B, C]
|
|
|
|
|
* if N = 1 => [[A], [B], [C]]
|
|
|
|
|
* if N = 2 => [[A, A], [B, A], [C, A], [A, B], [B, B], [C, B], [A, C], [B, C], [C, C]]
|
|
|
|
|
*
|
2013-03-29 22:10:13 +08:00
|
|
|
* @param objects list of objects
|
|
|
|
|
* @param n size of each combination
|
2012-08-16 02:36:06 +08:00
|
|
|
* @param withReplacement if false, the resulting permutations will only contain unique objects from objects
|
2013-03-29 22:10:13 +08:00
|
|
|
* @return a list with all combinations with size n of objects.
|
2012-08-15 03:02:45 +08:00
|
|
|
*/
|
2012-08-16 02:36:06 +08:00
|
|
|
public static <T> List<List<T>> makePermutations(final List<T> objects, final int n, final boolean withReplacement) {
|
2012-08-15 03:02:45 +08:00
|
|
|
final List<List<T>> combinations = new ArrayList<List<T>>();
|
|
|
|
|
|
2013-03-29 22:10:13 +08:00
|
|
|
if ( n == 1 ) {
|
2012-08-15 03:02:45 +08:00
|
|
|
for ( final T o : objects )
|
|
|
|
|
combinations.add(Collections.singletonList(o));
|
2013-03-29 22:10:13 +08:00
|
|
|
} else if (n > 1) {
|
2012-08-16 02:36:06 +08:00
|
|
|
final List<List<T>> sub = makePermutations(objects, n - 1, withReplacement);
|
2012-08-15 03:02:45 +08:00
|
|
|
for ( List<T> subI : sub ) {
|
|
|
|
|
for ( final T a : objects ) {
|
2012-08-16 02:36:06 +08:00
|
|
|
if ( withReplacement || ! subI.contains(a) )
|
|
|
|
|
combinations.add(Utils.cons(a, subI));
|
2012-08-15 03:02:45 +08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return combinations;
|
|
|
|
|
}
|
|
|
|
|
|
2012-03-31 03:28:47 +08:00
|
|
|
/**
|
|
|
|
|
* Convenience function that formats the novelty rate as a %.2f string
|
|
|
|
|
*
|
|
|
|
|
* @param known number of variants from all that are known
|
|
|
|
|
* @param all number of all variants
|
|
|
|
|
* @return a String novelty rate, or NA if all == 0
|
|
|
|
|
*/
|
|
|
|
|
public static String formattedNoveltyRate(final int known, final int all) {
|
|
|
|
|
return formattedPercent(all - known, all);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Convenience function that formats the novelty rate as a %.2f string
|
|
|
|
|
*
|
|
|
|
|
* @param x number of objects part of total that meet some criteria
|
|
|
|
|
* @param total count of all objects, including x
|
|
|
|
|
* @return a String percent rate, or NA if total == 0
|
|
|
|
|
*/
|
|
|
|
|
public static String formattedPercent(final long x, final long total) {
|
|
|
|
|
return total == 0 ? "NA" : String.format("%.2f", (100.0*x) / total);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Convenience function that formats a ratio as a %.2f string
|
|
|
|
|
*
|
|
|
|
|
* @param num number of observations in the numerator
|
|
|
|
|
* @param denom number of observations in the denumerator
|
|
|
|
|
* @return a String formatted ratio, or NA if all == 0
|
|
|
|
|
*/
|
|
|
|
|
public static String formattedRatio(final long num, final long denom) {
|
|
|
|
|
return denom == 0 ? "NA" : String.format("%.2f", num / (1.0 * denom));
|
|
|
|
|
}
|
2012-04-11 21:41:45 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Create a constant map that maps each value in values to itself
|
|
|
|
|
*/
|
|
|
|
|
public static <T> Map<T, T> makeIdentityFunctionMap(Collection<T> values) {
|
|
|
|
|
Map<T,T> map = new HashMap<T, T>(values.size());
|
|
|
|
|
for ( final T value : values )
|
|
|
|
|
map.put(value, value);
|
|
|
|
|
return Collections.unmodifiableMap(map);
|
|
|
|
|
}
|
|
|
|
|
|
2012-08-25 03:34:23 +08:00
|
|
|
/**
|
|
|
|
|
* Divides the input list into a list of sublists, which contains group size elements (except potentially the last one)
|
|
|
|
|
*
|
|
|
|
|
* list = [A, B, C, D, E]
|
|
|
|
|
* groupSize = 2
|
|
|
|
|
* result = [[A, B], [C, D], [E]]
|
|
|
|
|
*
|
|
|
|
|
*/
|
|
|
|
|
public static <T> List<List<T>> groupList(final List<T> list, final int groupSize) {
|
|
|
|
|
if ( groupSize < 1 ) throw new IllegalArgumentException("groupSize >= 1");
|
|
|
|
|
|
|
|
|
|
final List<List<T>> subLists = new LinkedList<List<T>>();
|
|
|
|
|
int n = list.size();
|
|
|
|
|
for ( int i = 0; i < n; i += groupSize ) {
|
|
|
|
|
subLists.add(list.subList(i, Math.min(i + groupSize, n)));
|
|
|
|
|
}
|
|
|
|
|
return subLists;
|
|
|
|
|
}
|
2012-11-14 04:21:57 +08:00
|
|
|
|
2013-01-31 05:41:23 +08:00
|
|
|
/**
|
|
|
|
|
* @see #calcMD5(byte[])
|
|
|
|
|
*/
|
|
|
|
|
public static String calcMD5(final String s) throws NoSuchAlgorithmException {
|
|
|
|
|
return calcMD5(s.getBytes());
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Calculate the md5 for bytes, and return the result as a 32 character string
|
|
|
|
|
*
|
|
|
|
|
* @param bytes the bytes to calculate the md5 of
|
|
|
|
|
* @return the md5 of bytes, as a 32-character long string
|
|
|
|
|
* @throws NoSuchAlgorithmException
|
|
|
|
|
*/
|
|
|
|
|
@Ensures({"result != null", "result.length() == 32"})
|
|
|
|
|
public static String calcMD5(final byte[] bytes) throws NoSuchAlgorithmException {
|
|
|
|
|
if ( bytes == null ) throw new IllegalArgumentException("bytes cannot be null");
|
|
|
|
|
final byte[] thedigest = MessageDigest.getInstance("MD5").digest(bytes);
|
|
|
|
|
final BigInteger bigInt = new BigInteger(1, thedigest);
|
|
|
|
|
|
|
|
|
|
String md5String = bigInt.toString(16);
|
|
|
|
|
while (md5String.length() < 32) md5String = "0" + md5String; // pad to length 32
|
|
|
|
|
return md5String;
|
|
|
|
|
}
|
HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes.
-- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
2013-03-14 22:03:04 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Does big end with the exact sequence of bytes in suffix?
|
|
|
|
|
*
|
|
|
|
|
* @param big a non-null byte[] to test if it a prefix + suffix
|
|
|
|
|
* @param suffix a non-null byte[] to test if it's a suffix of big
|
|
|
|
|
* @return true if big is proper byte[] composed of some prefix + suffix
|
|
|
|
|
*/
|
|
|
|
|
public static boolean endsWith(final byte[] big, final byte[] suffix) {
|
|
|
|
|
if ( big == null ) throw new IllegalArgumentException("big cannot be null");
|
|
|
|
|
if ( suffix == null ) throw new IllegalArgumentException("suffix cannot be null");
|
|
|
|
|
return new String(big).endsWith(new String(suffix));
|
|
|
|
|
}
|
Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.
Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows:
identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
for each base in the sequence from the starting vertex kmer:
look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue
If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue
This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers.
This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:
-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler
Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-04-18 20:17:15 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Get the length of the longest common prefix of seq1 and seq2
|
|
|
|
|
* @param seq1 non-null byte array
|
|
|
|
|
* @param seq2 non-null byte array
|
|
|
|
|
* @param maxLength the maximum allowed length to return
|
|
|
|
|
* @return the length of the longest common prefix of seq1 and seq2, >= 0
|
|
|
|
|
*/
|
|
|
|
|
public static int longestCommonPrefix(final byte[] seq1, final byte[] seq2, final int maxLength) {
|
|
|
|
|
if ( seq1 == null ) throw new IllegalArgumentException("seq1 is null");
|
|
|
|
|
if ( seq2 == null ) throw new IllegalArgumentException("seq2 is null");
|
|
|
|
|
if ( maxLength < 0 ) throw new IllegalArgumentException("maxLength < 0 " + maxLength);
|
|
|
|
|
|
|
|
|
|
final int end = Math.min(seq1.length, Math.min(seq2.length, maxLength));
|
|
|
|
|
for ( int i = 0; i < end; i++ ) {
|
|
|
|
|
if ( seq1[i] != seq2[i] )
|
|
|
|
|
return i;
|
|
|
|
|
}
|
|
|
|
|
return end;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Get the length of the longest common suffix of seq1 and seq2
|
|
|
|
|
* @param seq1 non-null byte array
|
|
|
|
|
* @param seq2 non-null byte array
|
|
|
|
|
* @param maxLength the maximum allowed length to return
|
|
|
|
|
* @return the length of the longest common suffix of seq1 and seq2, >= 0
|
|
|
|
|
*/
|
|
|
|
|
public static int longestCommonSuffix(final byte[] seq1, final byte[] seq2, final int maxLength) {
|
|
|
|
|
if ( seq1 == null ) throw new IllegalArgumentException("seq1 is null");
|
|
|
|
|
if ( seq2 == null ) throw new IllegalArgumentException("seq2 is null");
|
|
|
|
|
if ( maxLength < 0 ) throw new IllegalArgumentException("maxLength < 0 " + maxLength);
|
|
|
|
|
|
|
|
|
|
final int end = Math.min(seq1.length, Math.min(seq2.length, maxLength));
|
|
|
|
|
for ( int i = 0; i < end; i++ ) {
|
|
|
|
|
if ( seq1[seq1.length - i - 1] != seq2[seq2.length - i - 1] )
|
|
|
|
|
return i;
|
|
|
|
|
}
|
|
|
|
|
return end;
|
|
|
|
|
}
|
2013-05-01 23:13:58 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Trim any number of bases from the front and/or back of an array
|
|
|
|
|
*
|
|
|
|
|
* @param seq the sequence to trim
|
|
|
|
|
* @param trimFromFront how much to trim from the front
|
|
|
|
|
* @param trimFromBack how much to trim from the back
|
|
|
|
|
* @return a non-null array; can be the original array (i.e. not a copy)
|
|
|
|
|
*/
|
|
|
|
|
public static byte[] trimArray(final byte[] seq, final int trimFromFront, final int trimFromBack) {
|
|
|
|
|
if ( trimFromFront + trimFromBack > seq.length )
|
|
|
|
|
throw new IllegalArgumentException("trimming total is larger than the original array");
|
|
|
|
|
|
|
|
|
|
// don't perform array copies if we need to copy everything anyways
|
|
|
|
|
return ( trimFromFront == 0 && trimFromBack == 0 ) ? seq : Arrays.copyOfRange(seq, trimFromFront, seq.length - trimFromBack);
|
|
|
|
|
}
|
2011-01-21 06:34:43 +08:00
|
|
|
}
|