Commit Graph

2207 Commits (b4f482a4212984b9b2927063261917f50b76634f)

Author SHA1 Message Date
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Yossi Farjoun 4b8b411b92 - Fixed a small bug in the printout of molten data in GenotypeConcordance
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.

- Updated MD5's after looking at diffs to verify that the change is what I expected.
2013-05-02 09:16:07 -04:00
Mark DePristo 73fcacbf1b Change Long to long 2013-04-30 09:21:10 -04:00
Yossi Farjoun 0e7e6d35d8 GATKBAMIndex calls buffer.length() on every read. This is causing much pain.
Optimized by getting the read of the file upon opening the index-file and using that instead.
2013-04-29 12:49:02 -04:00
Mark DePristo 759c531d1b Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check
Add support for snpEff "GATK compatibility mode" (-o gatk)
2013-04-26 13:55:14 -07:00
David Roazen 7d90bbab08 Add support for snpEff "GATK compatibility mode" (-o gatk)
-Do not throw an exception when parsing snpEff output files
 generated by not-officially-supported versions of snpEff,
 PROVIDED that snpEff was run with -o gatk

-Requested by the snpEff author

-Relevant integration tests updated/expanded
2013-04-26 15:47:15 -04:00
Mark DePristo 071fd67d55 Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads
Eb contamination fixing for reduced reads
2013-04-26 09:48:45 -07:00
Mark DePristo 92a6c7b561 Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants
Fixed bug reported on the forum where using the --exclude_sample_file ar...
2013-04-26 09:47:38 -07:00
Eric Banks 360e2ba87e Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results.
Added integration test.
https://www.pivotaltracker.com/s/projects/793457/stories/47399245
2013-04-26 12:23:11 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
David Roazen 4d56142163 Detect stuck lock-acquisition calls, and disable file locking for tests
-Acquire file locks in a background thread with a timeout of 30 seconds,
 and throw a UserException if a lock acquisition call times out

    * should solve the locking issue for most people provided they
      RETRY failed farm jobs

    * since we use NON-BLOCKING lock acquisition calls, any call that
      takes longer than a second or two indicates a problem with the
      underlying OS file lock support

    * use daemon threads so that stuck lock acquisition tasks don't
      prevent the JVM from exiting

-Disable both auto-index creation and file locking for integration tests
 via a hidden GATK argument --disable_auto_index_creation_and_locking_when_reading_rods

    * argument not safe for general use, since it allows reading from
      an index file without first acquiring a lock

    * this is fine for the test suite, since all index files already
      exist for test files (or if they don't, they should!)

-Added missing indices for files in private/testdata

-Had to delete most of RMDTrackBuilderUnitTest, since it mostly tested auto-index
 creation, which we can't test with locking disabled, but I replaced the deleted
 tests with some tests of my own.

-Unit test for FSLockWithShared to test the timeout feature
2013-04-24 22:49:02 -04:00
Guillermo del Angel a971e7ab6d Several improvements to ReadAdaptorTrimmer so that it can be incorporated into ancient DNA processing pipelines (for which it was developed):
-- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out.
-- Add integration tests
-- Move walker to public
2013-04-13 13:41:36 -04:00
Mauricio Carneiro 3960733c88 Fix PrintReads out of space issue
Problem:
--------
Print Reads was running out of disk space when using the -BQSR option even for small bam files

Solution:
---------
Configure setupWriter to expect pre sorted reads
2013-04-09 08:19:52 -04:00
Mark DePristo 1b36db8940 Make ActiveRegionTraversal robust to excessive coverage
-- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time.  Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount.  This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation.
-- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly.
-- #resolves GSA-921
2013-04-08 15:48:19 -04:00
Mark DePristo 317dc4c323 Add size() method to Downsampler interface
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>.  Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones.  All of the others are trivially obviously correct.
2013-04-08 15:48:13 -04:00
Mark DePristo 167cd49e71 Added -forceActive argument to ActiveRegionWalkers
-- Causes the ART tool to treat all bases as active.   Useful for debugging
2013-04-08 12:47:48 -04:00
Eric Banks 6253ba164e Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs
* This occurred when one or more alleles were lost from the record after selection
  * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718
  * Added some integration tests for --keepOriginalAC (there were none before)
2013-04-05 00:53:28 -04:00
Eric Banks 7897d52f32 Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec).
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
  * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
  * Added integration test to confirm failure with User Error
  * Removed illegal header line in KB test VCF that was causing related tests to fail.
2013-04-05 00:52:32 -04:00
Ryan Poplin 8a93bb687b Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists.
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
2013-04-03 13:15:30 -04:00
David Roazen 2eac97a76c Remove auto-creation of fai/dict files for fasta references
-A UserException is now thrown if either the fai or dict file for the
 reference does not exist, with pointers to instructions for creating
 these files.

-Gets rid of problematic file locking that was causing intermittent
 errors on our farm.

-Integration tests to verify that correct exceptions are thrown in
 the case of a missing fai / dict file.

GSA-866 #resolve
2013-04-02 18:34:08 -04:00
Mark DePristo 7c83efc1b9 Merge pull request #135 from broadinstitute/mc_pgtag_fix
Fixing @PG tag uniqueness issue
2013-03-31 11:36:40 -07:00
Guillermo del Angel 9686e91a51 Added small feature to VariantFiltration to filter sites outside of a given mask:
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
2013-03-31 08:48:16 -04:00
Mauricio Carneiro ec475a46b1 Fixing @PG tag uniqueness issue
The Problem:
------------
the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records".

How this fixes it:
------------------
This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec.

Summarized changes:
-------------------
* Remove keep_all_pg_records option from setupWriter utility methos in Util
* Update all walkers to now replace the last @PG tag of the same walker (if it already exists)
* Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable
* Simplify the multiple implementations to setupWriter

Bamboo:
-------
http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31

Issue Tracker:
--------------
[fixes 47100885]
2013-03-30 20:31:33 -04:00
Guillermo del Angel 6b8bed34d0 Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work.
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
2013-03-30 19:31:28 -04:00
Guillermo del Angel 8fbf9c947f Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production:
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
2013-03-29 10:02:06 -04:00
Chris Hartl 73d1c319bf Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils
Currently, the multi-allelic test is covering the following case:

Eval   A   T,C
Comp   A   C

reciprocate this so that the reverse can be covered.

Eval   A   C
Comp   A   T,C

And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.

This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:

Eval:   A  G,T      0/0   2/0   2/2   1/1
Comp:   A  C,T      0/0   1/0   0/0   0/0

Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:

Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)

Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.

Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
 - dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
 - vectorSum
 - array shuffle, random subset
 - countOccurances (general forms, the char form is used in the codebase)
 - getNMaxElements
 - array permutation
 - sorted array permutation
 - compare floats
 - sum() (for integer arrays and lists).

Final keyword was extensively added to MathUtils.

The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).

The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.

In addition, more extensive tests were added for
 - logBinomialCoefficient (Newton's identity should always hold)
 - logFactorial
 - log10sumlog10 and its approximation

All unit tests pass
2013-03-28 23:25:28 -04:00
Mark DePristo 12475cc027 Display the active MappingQualityFilter if mmq > 0 in the HaplotypeCaller 2013-03-26 14:27:18 -04:00
Mauricio Carneiro eb33da6820 Added support to reduce reads to Callable Loci
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
2013-03-21 15:53:04 -04:00
Mark DePristo 7ae15dadbe HC now by default only uses reads with MAPQ >= 20 for assembly and calling
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements.  The new version simply uses a read filter to remove them upfront.  This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
2013-03-21 13:10:50 -04:00
Mark DePristo a8fb26bf01 A generic downsampler that reduces coverage for a bunch of reads
-- Exposed the underlying minElementsPerStack parameter for LevelingDownsampler
2013-03-20 22:54:35 -04:00
Geraldine Van der Auwera d70bf64737 Created new DeprecatedToolChecks class
--Based on existing code in GenomeAnalysisEngine
	--Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any)
	--Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings...
	--Added user exception for deprecated annotations
	--Added deprecation check to AnnotationInterfaceManager.validateAnnotations
	--Run when annotations are initialized
	--Made annotation sets instead of lists
2013-03-20 06:46:02 -04:00
Geraldine Van der Auwera 6b4d88ebe9 Created ListAnnotations utility (extends CommandLineProgram)
--Refactored listAnnotations basic method out of VA into HelpUtils
	--HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools)
	--This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about
	--We could get rid of the VA --list option altogether ...?
2013-03-20 06:15:27 -04:00
Geraldine Van der Auwera 95a9ed853d Made some documentation updates & fixes
--Mostly doc block tweaks
	--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.
2013-03-20 06:15:20 -04:00
Ami Levy-Moonshine 0e9c1913ff fix typos in argument docs and in printed output in CoveredByNSamplesSites and rewrite an unaccurate comment 2013-03-18 13:54:21 -04:00
Eric Banks 232afdcbea Added check in the MalformedReadFilter for reads without stored bases (i.e. that use '*').
* We now throw a User Error for such reads
  * User can override this to filter instead with --filter_bases_not_stored
  * Added appropriate unit test
2013-03-14 17:17:26 -04:00
droazen 0fd9f0e77c Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837
Fixed the logic of the @Output annotation and its interaction with 'required'
2013-03-14 12:52:00 -07:00
Eric Banks 6d6264b108 Merge pull request #105 from broadinstitute/gg_annotations_cleanup_45802765
Cleaned up annotations
2013-03-14 11:35:00 -07:00
Geraldine Van der Auwera 61349ecefa Cleaned up annotations
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
  - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
  - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
  - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
  - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
  - Removed unnecessary check in AverageAltAlleleLength
2013-03-14 14:26:48 -04:00
Eric Banks 7cab709a88 Fixed the logic of the @Output annotation and its interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:

I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
  * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
  * The logic for @Output is now:
    * if required==true then -o MUST be provided or a User Error is generated.
    * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
      * this is the default behavior (i.e. @Output with no modifiers).
    * if required==false and defaultToStdout==false then the output object is null.
      * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).

  * I have updated walkers so that previous behavior has been maintained (as best I could).
    * In general, all @Outputs with default long/short names have required=false.
    * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
  * I added unit tests for @Output changes with David's help (thanks!).
  * #resolve GSA-837
2013-03-14 11:58:51 -04:00
Eric Banks ff87b62fe3 Fixed bug in SelectVariants where maxIndelSize argument wasn't getting applied to deletions.
Added unit tests and docs.
2013-03-13 15:11:34 -04:00
Mark DePristo b3f67899b5 Merge pull request #101 from broadinstitute/dr_fix_failing_parallel_tests
Fix more tests that fail when run in parallel on the farm
2013-03-12 14:11:02 -07:00
David Roazen cdb1fa1105 Fix more tests that fail when run in parallel on the farm
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
 to be overridden via a constructor argument, and use a timeout
 of 300 seconds for tests. The timeout remains 30 seconds in all
 other cases.

-Change integration tests that themselves dispatch farm jobs
 into pipeline tests. Necessary because some farm nodes are
 not set up as submit hosts. Pipeline tests are still run
 directly on gsa4.

-Bump up the timeout for the MaxRuntimeIntegrationTest even more
 (was still occasionally failing on the farm!)
2013-03-12 16:53:30 -04:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Eric Banks 2be57fbcfb Merged bug fix from Stable into Unstable 2013-03-05 13:28:46 -05:00
Eric Banks 5e89f01e10 Don't allow the use of compressed (.gz) references in the GATK. 2013-03-05 13:28:19 -05:00