## Sting to GATK renaming http://gatkforums.broadinstitute.org/gatk/discussion/4173/sting-to-gatk-renaming
The GATK 3.2 source code uses new java package names, directory paths, and executable jars. Post GATK 3.2, any patches submitted via pull requests should also include classes moved to the appropriate artifact.
Note that the document includes references to the private module, which is part of our internal development codebase but is not available to the general public.
A long term ideal of the GATK is to separate out reusable parts and eventually make them available as compiled libraries via centralized binary repositories. Ahead of publishing a number of steps must be completed. One of the larger steps has been completed for GATK 3.2, where the code base rebranded all references of Sting to GATK.
Currently implemented changes include:
As of May 16, 2014, remaining TODOs ahead of publishing to central include:
Now that the new package names and Maven artifacts are available, any pull request should include ensuring that updated classes are also moved into the correct GATK Maven artifact. While there are a significant number of classes, cleaning up as we go along will allow the larger task to be completed in a distributed fashion.
The full lists of new Maven artifacts and renamed packages are below under [Renamed Artifact Directories]. For those developers in the middle of a git rebase around commits before and after 3.2, here is an abridged mapping of renamed directories for those trying to locate files:
| Old Maven Artifact | New Maven Artifact |
|---|---|
public/sting-root |
public/gatk-root |
public/sting-utils |
public/gatk-utils |
public/gatk-framework |
public/gatk-tools-public |
public/queue-framework |
public/gatk-queue |
protected/gatk-protected |
protected/gatk-tools-protected |
private/gatk-private |
private/gatk-tools-private |
private/queue-private |
private/gatk-queue-private |
QScripts are no longer located with the Queue engine, and instead are now located with the GATK wrappers implemented as Queue extensions. See [Separated Queue Extensions] for more info.
Starting with GATK 3.2, separate Maven utility artifacts exist to separate reusable portions of the GATK engine apart from tool specific implementations. The biggest impact this will have on developers is the separation of the walkers packages.
In GATK versions <= 3.1 there was one package for both the base classes and the implementations of walkers:
In GATK versions >= 3.2 threre are two packages. The first contains the base interfaces, annotations, etc. The latter package is for the concrete tools implemented as walkers:
org.broadinstitute.gatk.engine.walkers
Previously, depending on how the source code was compiled, the executable gatk-package-3.1.jar and queue-package-3.1.jar (aka GenomeAnalysisTK.jar and Queue.jar) contained various mixes of public/protected/private code. For example, if the private directory was present when the source code was compiled, the same artifact named gatk-package-3.1.jar might, or might not contain private code.
Starting with 3.2, there are two versions of the jar created, each with specific file contents.
| New Maven Artifact | Alias in the /target folder | Packaged contents |
|---|---|---|
| gatk-package-distribution-3.2.jar | GenomeAnalysisTK.jar | public,protected |
| gatk-package-internal-3.2.jar | GenomeAnalysisTK-internal.jar | public,protected,private |
| gatk-queue-package-distribution-3.2.jar | Queue.jar | public,protected |
| gatk-queue-package-internal-3.2.jar | Queue-internal.jar | public,protected,private |
When creating a packaged version of Queue, the GATKExtensionsGenerator builds Queue engine compatible command line wrappers around each GATK walker. Previously, the wrappers were generated during the compilation of the Queue framework. Similar to the binary packages, depending on who built the source code, queue-framework-3.1.jar would contain various mixes of public/protected/private wrappers.
Starting with GATK 3.2, the gatk-queue-3.2.jar only contains code for the Queue engine. Generated and manually created extensions for wrapping any other command line programs are all included in separate artifacts. Due to a current limitation regarding how the generator uses reflection, the generator cannot build wrappers for just private classes without also generating protected and public classes. Thus, there are three different Maven artifacts generated, that contain different mixes of public, protected and private wrappers.
| Extensions Artifact | Generated wrappers for GATK tools |
|---|---|
| gatk-queue-extensions-public-3.2.jar | public only |
| gatk-queue-extensions-distribution-3.2.jar | public,protected |
| gatk-queue-extensions-internal-3.2.jar | public,protected,private |
As for QScripts that used to be located with the framework, they are now located with the generated wrappers.
| Old QScripts Artifact Directory | New QScripts Artifact Directory |
|---|---|
public/queue-framework/src/main/qscripts |
public/gatk-queue-extensions-public/src/main/qscripts |
private/queue-private/src/main/qscripts |
private/gatk-queue-extensions-internal/src/main/qscripts |
The following list shows the mapping of artifact names pre and post GATK 3.2. In addition to the engine changes, the packaging updates and extensions changes above also affected Maven artifact refactoring. The packaging artifacts have split from a single public to protected and private versions, and new queue extensions artifacts have been added as well.
| Maven Artifact <= GATK 3.1 | Maven Artifact >= GATK 3.2 |
|---|---|
/pom.xml (sting-aggregator) |
/pom.xml _(gatkaggregator) |
public/sting-root |
public/gatk-root |
public/sting-utils |
public/gatk-utils |
| none | public/gatk-engine |
public/gatk-framework |
public/gatk-tools-public |
public/queue-framework |
public/gatk-queue |
public/gatk-queue-extgen |
public/gatk-queue-extensions-generator |
protected/gatk-protected |
protected/gatk-tools-protected |
private/gatk-private |
private/gatk-tools-private |
private/queue-private |
private/gatk-queue-private |
public/gatk-package |
protected/gatk-package-distribution |
public/queue-package |
protected/gatk-queue-package-distribution |
| none | private/gatk-package-internal |
| none | private/gatk-queue-package-internal |
| none | public/gatk-queue-extensions-public |
| none | protected/gatk-queue-extensions-distribution |
| none | private/gatk-queue-extensions-internal |
A note regarding the aggregator:
The aggregator is the pom.xml in the top directory level of the GATK source code. When someone clones the GATK source code and runs mvn in the top level directory, the aggregator the pom.xml executed.
The root is a pom.xml that contains all common Maven configuration. There are a couple dependent pom.xml files that inherit configuration from the root, but are NOT aggregated during normal source compilation.
As of GATK 3.2, these un-aggregated child artifacts are VectorPairHMM and picard-maven. They should not run by default with each instance of mvn run on the GATK source code.
For more clarification on Maven Inheritance vs. Aggregation, see the Maven introduction to the pom.
In GATK 3.2, except for classes with Sting in the name, all file names are still the same. To locate migrated files under new java package names, developers should either use Intellij IDEA Navigation or /bin/find to locate the same file they used previously.
The biggest change most developers will face is the new package names for GATK classes. Code entanglement does not permit simply moving the classes into the correct Maven artifacts, as a few number of lines of code must be edited inside a large number of files. So post renaming only a very small number of classes were moved out of the incorrect Maven artifacts as examples.
As of the May 16, 2014, the migrated GATK package distribution is as follows. This list includes only main classes. The table excludes all tests, renamed files such as StingException, certain private Queue wrappers, and qscripts renamed to end in *.scala.
| Scope | Type | <= 3.1 Artifact | <= 3.1 Package | >= GATK 3.2 Artifact | >= 3.2 GATK Package | Files |
|---|---|---|---|---|---|---|
| public | java | gatk-framework | o.b.s | gatk-utils | o.b.g | 4 |
| public | java | gatk-framework | o.b.s.gatk | gatk-engine | o.b.g.engine | 2 |
| public | java | gatk-framework | o.b.s | gatk-tools-public | o.b.g | 202 |
| public | java | gatk-framework | o.b.s | gatk-tools-public | o.b.g.utils | 49 |
| public | java | gatk-framework | o.b.s | gatk-tools-public | o.b.g.engine | 34 |
| public | java | gatk-framework | o.b.s.gatk | gatk-tools-public | o.b.g.engine | 244 |
| public | java | gatk-framework | o.b.s.gatk | gatk-tools-public | o.b.g.tools | 134 |
| public | java | gatk-framework | o.b.s.gatk | gatk-tools-public | o.b.g.tools.walkers | 2 |
| protected | java | gatk-protected | o.b.s | gatk-tools-protected | o.b.g | 44 |
| protected | java | gatk-protected | o.b.s.gatk | gatk-tools-protected | o.b.g.engine | 1 |
| protected | java | gatk-protected | o.b.s.gatk | gatk-tools-protected | o.b.g.tools | 209 |
| private | java | gatk-private | o.b.s | gatk-tools-private | o.b.g | 23 |
| private | java | gatk-private | o.b.s | gatk-tools-private | o.b.g.utils | 7 |
| private | java | gatk-private | o.b.s.gatk | gatk-tools-private | o.b.g.engine | 5 |
| private | java | gatk-private | o.b.s.gatk | gatk-tools-private | o.b.g.tools | 133 |
| public | java | queue-framework | o.b.s | gatk-queue | o.b.g | 2 |
| public | scala | queue-framework | o.b.s | gatk-queue | o.b.g | 72 |
| public | scala | queue-framework | o.b.s | gatk-queue-extensions-public | o.b.g | 31 |
| public | qscripts | queue-framework | o.b.s | gatk-queue-extensions-public | o.b.g | 12 |
| private | scala | queue-private | o.b.s | gatk-queue-private | o.b.g | 2 |
| private | qscripts | queue-private | o.b.s | gatk-queue-extensions-internal | o.b.g | 118 |
During all future code modifications and pull requests, classes should be refactored to correct artifacts and package as follows.
All non-engine tools should be in the tools artifacts, with appropriate sub-package names.
| Scope | Type | Artifact | Package(s) |
|---|---|---|---|
| public | java | gatk-utils | o.b.g.utils |
| public | java | gatk-engine | o.b.g.engine |
| public | java | gatk-tools-public | o.b.g.tools.walkers |
| public | java | gatk-tools-public | o.b.g.tools.* |
| protected | java | gatk-tools-protected | o.b.g.tools.walkers |
| protected | java | gatk-tools-protected | o.b.g.tools.* |
| private | java | gatk-tools-private | o.b.g.tools.walkers |
| private | java | gatk-tools-private | o.b.g.tools.* |
| public | java | gatk-queue | o.b.g.queue |
| public | scala | gatk-queue | o.b.g.queue |
| public | scala | gatk-queue-extensions-public | o.b.g.queue.extensions |
| public | qscripts | gatk-queue-extensions-public | o.b.g.queue.qscripts |
| private | scala | gatk-queue-private | o.b.g.queue |
| private | qscripts | gatk-queue-extensions-internal | o.b.g.queue.qscripts |
The following class names were updated to replace Sting with GATK.
| Old Sting class | New GATK class |
|---|---|
ArtificialStingSAMFileWriter |
ArtificialGATKSAMFileWriter |
ReviewedStingException |
ReviewedGATKException |
StingException |
GATKException |
StingSAMFileWriter |
GATKSAMFileWriter |
StingSAMIterator |
GATKSAMIterator |
StingSAMIteratorAdapter |
GATKSAMIteratorAdapter |
StingSAMRecordIterator |
GATKSAMRecordIterator |
StingTextReporter |
GATKTextReporter |
The 3.2 renaming patch is actually split into two commits. The first commit renames the files without making any content changes, while the second changes the contents of the files without changing any file paths.
When dealing with renamed files, it is best to work with a clean directory during rebasing. It will be easier for you track files that you may not have added to git.
After running a git rebase or merge, you may first run into problems with files that you renamed and were moved during the GATK 3.2 package renaming. As a general rule, the renaming only changes directory names. The exception to this rule are classes such as StingException that are renamed to GATKException, and are listed under [Renamed Classes]. The workflow for resolving these merge issues is to find the list of your renamed files, put your content in the correct location, then register the changes with git.
To obtain the list of renamed directories and files:
git status to get a list of affected filesThen, to resolve the issue for each file:
git rm the old paths as appropriategit add the new pathUpon first rebasing you will see a lot of text. At this moment, you can ignore most of it, and use git status instead.
For the purposes of illustration, while running git rebase it is perfectly normal to see something similar to:
$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: <<< Your first commit message here >>>
Using index info to reconstruct a base tree...
A protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
A protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
<<<Other files that you renamed.>>>
warning: squelched 12 whitespace errors
warning: 34 lines add whitespace errors.
Falling back to patching base and 3-way merge...
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java" in branch "HEAD" rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java" in "<<< Your first commit message here >>>"
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java" in branch "HEAD" rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java" in "<<< Your first commit message here >>>"
Failed to merge in the changes.
Patch failed at 0001 Example conflict.
The copy of the patch that failed is found in:
/Users/zzuser/src/gsa-unstable/.git/rebase-apply/patch
When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".
$
While everything you need to resolve the issue is technically in the message above, it may be much easier to track what's going on using git status.
$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
(fix conflicts and then run "git rebase --continue")
(use "git rebase --skip" to skip this patch)
(use "git rebase --abort" to check out the original branch)
Unmerged paths:
(use "git reset HEAD <file>..." to unstage)
(use "git add/rm <file>..." as appropriate to mark resolution)
added by them: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
both deleted: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
added by them: protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java
both deleted: protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
added by us: protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java
added by us: protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java
Untracked files:
(use "git add <file>..." to include in what will be committed)
<<< possible untracked files if your working directory is not clean>>>
no changes added to commit (use "git add" and/or "git commit -a")
$
Let's look at the main java file as an example. If you are having issues figuring out the new directory and new file name, they are all listed in the output.
Path in the common ancestor branch:
| old source directory | old package name | old file name |
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
Path in the new master branch before merge:
| new source directory | new package name | old file name |
protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java
Path in your branch before merge:
| old source directory | old package name | new file name |
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
Path in your branch post merge:
| new source directory | new package name | new file name |
protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
After identifying the new paths for use post merge, use the following workflow for each file:
git rm the three old file paths: common ancestor, old directory with new file name, and new directory with old file namegit add the new file name in the new directoryAfter you process all files correctly, in the output of git status you should see the "all conflicts fixed" and all your files renamed.
$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
(all conflicts fixed: run "git rebase --continue")
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
renamed: protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java -> protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
renamed: protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java -> protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java
Untracked files:
(use "git add <file>..." to include in what will be committed)
<<< possible untracked files if your working directory is not clean>>>
$
Continue your rebase, handling other merges as normal.
$ git rebase --continue
Because all the packages names are different in 3.2, while rebasing you may run into conflicts due to imports you also changed. Use your favorite editor to fix the imports within the files. Then try recompiling, and repeat as necessary until your code works.
While editing the files with conflicts with a basic text editor may work, IntelliJ IDEA also offers a special merge tool that may help via the menu:
VCS > Git > Resolve Conflicts...
For each file, click on the "Merge" button in the first dialog. Use the various buttons in the Conflict Resolution Tool to automatically accept any changes that are not in conflict. Then find any edit any remaining conflicts that require further manual intervention.
Once you begin editing the import statements in the three way merge tool, another IntelliJ IDEA 13.1 feature that may speed up repairing blocks of import statements is Multiple Selections. Find a block of import lines that need the same changes. Hold down the option key as you drag your cursor vertically down the edit point on each import line. Then begin typing or deleting text from the multiple lines.
Even after a successful merge, you may still run into stale GATK code or links from modifications before and after the 3.2 package renaming. To significantly reduce these chances, run mvn clean before and then again after switching branches.
If this doesn't work, run mvn clean && git status, looking for any directories you don't that shouldn't be in the current branch. It is possible that some files were not correctly moved, including classes or test resources. Find the file still in the old directories via a command such as find public/gatk-framework -type f. Then move them to the correct new directories and commit them into git.
Due to the [Renamed Binary Packages], the separate artifacts including and excluding private code are now packaged during the Maven package build lifecycle.
When building packages, to significantly speed up the default packaging time, if you only require the GATK tools run mvn verify -P\!queue.
Alternatively, if you do not require building private source, then disable private compiling via mvn verify -P\!private.
The two may be combined as well via: mvn verify -P\!queue,\!private.
The exclamation mark is a shell command that must be escaped, in the above case with a backslash. Shell quotes may also be used: mvn verify -P'!queue,!private'.
Alternatively, developers with access to private may often want to disable packaging the protected distributions. In this case, use the gsadev profile. This may be done via mvn verify -Pgsadev or, excluding Queue, mvn verify -Pgsadev,\!queue.
Users see errors from maven when an unclean repo in git is updated.
Because BaseTest.java currently hardcodes relative paths to
"public/testdata", maven creates these symbolic links all over the
file system to help the various tests in different modules find the
relative path "
However, our Maven support has evolved from 2.8, to 3.0, to now the 3.2 renaming, each time has changed the symbolic link's target directory. Whenever a stale symbolic link to an old testdata directory remains in the users folder, maven is saying it will not remove the link, because maven basically doesn't know why the link is pointing to the wrong folder (answer, the link is from an old git checkout) and thinks it's a bug in the build.
If one doesn't have an stale / unclean maven repo when updating git via merge/rebase/checkout, you will never see this issue.
The script that can remove the stale symlinks, public/src/main/scripts/shell/delete_maven_links.sh, should run automatically during a mvn test-compile or mvn verify.