-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
-Explicitly state that -dcov does not produce an unbiased random sampling from all available reads
at each locus, and that instead it tries to maintain an even representation of reads from
all alignment start positions (which, of course, is a form of bias)
-Recommend -dfrac for users who want a true across-the-board unbiased random sampling
-Given a list of walkers and a pair of git commits, determines whether each of the
walkers has compile-time dependencies on the Java classes changed between the two
commits.
-Output is in the form of a Java properties file, and can be easily loaded via
the Properties class. Example output:
org.broadinstitute.sting.gatk.walkers.bed.MergeIntervalLists=true
org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper=false
org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller=false
org.broadinstitute.sting.gatk.walkers.na12878kb.NA12878DBWalker=true
org.broadinstitute.sting.gatk.walkers.readutils.PrintReads=false
"true" indicates that the walker does have compile-time dependencies on one or
more of the changed Java classes, "false" indicates no dependencies
-Considers classes within changed jar files as well, provided the jars are stored
in our git repository (as they are with tribble, picard, etc.)
-Ant-based solution with a shell script frontend. The previous Java-based solution
had several issues and introduced problematic dependencies into the GATK.
-- Because LocusWalkers have multiple filtering streams, each counting filtering independent, and the close() function set calling setFilter on the global result, not on the private counter, which is incorporated into the global (thereby incrementing the counts of each filter).
-- [delivers #52667213]
There are a few pipeline test classes that do not run Queue, but are
classified as pipeline tests because they submit farm jobs. Make these
unconventional pipeline tests respect the pipeline test dry run setting.
This is used in conjunction with the -BAM argument in AssessNA12878 and is necessary for the
Jenkins assessment to work properly (Ryan's commit wasn't enough).
I "fixed" this once before but instead of testing with unit tests I used integration tests.
Bad decision.
The proper fix is in now, with a bonafide unit test included.
1. MergeIntervalLists should take the global interval padding into account when merging.
2. Update the name of the imported callsets in the setup script because of renaming for expanded intervals.
3. If there are too many intervals to process, MongoDB falls apart. Refactored the site selection code so
that in such cases we pull out all records from the DB and the GATK itself does the interval filtering.
4. Add isComplex to callset summary for the consensus summarizer.
5. Remove the check for out of order records in the SiteIterator since records now do come out of order
(since contigs are sorted lexicographically in MongoDB).
Results:
Iteration over the gencode intervals (90 MB) in AssessNA12878 now takes 90 seconds. I can't tell you how
much time it took before because it kept crashing Mongo (but it was a long, long time).
Previous fixes and tests only covered trailing soft-clips. Now that up front
hard-clipping is working properly though, we were failing on those in the tool.
Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
* Increase the memory limit for HTSLIB - Bam shuffling just eats up a ton of memory.
* Concurrent HTSLIB processes need unique temp files the bam shuffling step was messing up with the temporary files and failing without returning zero. Fixed it by giving a unique name to each process.
This time we don't accidentally drop reads (phew), but this bug does cause us not to
update the alignment start of the mate. Fixed and added unit test to cover it.
-- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class
-- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported.
-- Added @Override to many of the RankSum classes for safety's sake
-- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs,
-- R script to assess the value of annotations for VQSR
-- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.
-- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled. Check in initialize if this is the case, and throw a UserException
Github was intermittently rejecting large pushes that were in fact
fast-forward updates as being non-fast-forward. Try to prevent this
by ensuring that all refs are up-to-date and properly checked out
after branch filtering and before doing a source release.
-- Previous version emitted command lines that look like:
##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."
the new version provides additional information on when the GATK was run and the GATK version in a nicer format:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">
-- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">
-- Removed the ProtectedEngineFeaturesIntegrationTest
-- Actual unit tests for these features!