Changes to the build script that allow the external directory to have tests.
This means groups like CGA don't have to reinvent the wheel on testing, and
can instead use the GATKs unit and integration tests.
Signed-off-by: David Roazen <droazen@broadinstitute.org>
-- Print out the name / db of SQL server, not a python connection object
-- Print out the ID, not a python objects, of XML record that fails to convert
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
When run without an explicit -onlyCoding option, as we've been doing up to
now, snpEff automatically sets -onlyCoding to "true" provided that there is
at least one transcript marked as "protein_coding", which will always be the
case for us in practice (and indeed, all pipeline runs so far with snpEff
2.0.5 have run with -onlyCoding auto-set to "true").
However, given the disastrous effect on annotation quality setting
"-onlyCoding false" has, we wish to be explicit with this option
rather than relying on snpEff's auto-detection logic.
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
a few updates on the CalibrateGenotypeLikelihoods walker output
* Fixed ggplot2 issue with dataset with poor coverage
* Added jitter as default geometry
* Dropped the cut by technology from the graphs
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.
Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
* Assert that a user provided a VCF not some other type of ROD
* Assert that the VCF has samples
* Assert that the samples in the BAM exist in the VCF
* Warn the user if not all samples in the BAM are present in the VCF
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests