BQSR: restore ability to specify custom covariates and to disable standard covariates. (Take 2) #9206

takutosato · 2025-06-12T20:27:46Z

This PR continues @droazen's prior PR #4688 from circa 2018, where he restored the ability to add custom covariates, which I've been told was a feature in GATK3 but was removed from GATK4 for performance reasons. (@lbergelson please correct me if I'm wrong.) He also added a sample custom covariate, RepeatLengthCovariate.

I rebased the branch, added comments where needed, and added tests.

gatk-bot · 2025-06-12T20:53:05Z

Github actions tests reported job failures from actions build 15620149812
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	17.0.6+10	15620149812.10	logs
integration	17.0.6+10	15620149812.11	logs
integration	17.0.6+10	15620149812.0	logs

…ndard covariates Added a quick-and-dirty temporary port of RepeatLengthCovariate to test this functionality. Resolves #4568

cmnbroad

Checkpointing what I have for now. More next week. A couple of meta-comments:

There is no BQSRIntegration test (only an Apply integration test - or maybe thats the only way ?)
The Spark tools/pipelines don’t handle the new args added to the arg collection. It may not be worth integrating them there, but we should at least reject them if/when they’re specified.

cmnbroad · 2025-07-21T18:58:44Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+     * @param value             Covariate value
+     * @return                  Split pair
+     */
+    public static Pair<String,Integer> getRUandNRfromCovariate(final String value) {


This method is unused. Can it be removed ?

cmnbroad · 2025-07-21T20:18:12Z

src/main/java/org/broadinstitute/hellbender/utils/recalibration/covariates/Covariate.java

+     * @param RAC the recalibration argument collection
+     * @param readGroups (only used by the ReadGroup covariate --- consider refactoring)
+     */
+    public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups);


I'm not a big fan of adding this method, since it requires many implementing classes which are currently immutable to now use mutable variables. We could just use a constructor as was done previously (instead of assuming a no-arg constructor, you could find and call a constructor with the expected signature (final RecalibrationArgumentCollection RAC, final List<String> readGroups), which can you do via reflection.

Alternatively, if you want to keep this method, then I would mark it as "default", and add an empty implementation. Then at least the implementations that don't use these args won't have to implement awkward boilerplate code that does nothing.

Also, public is redundant in interfaces (although it looks like the existing code already uses it).

cmnbroad · 2025-07-21T20:25:46Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+    @Override
+    public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups) {
+        MAX_STR_UNIT_LENGTH = 8;
+        MAX_REPEAT_LENGTH = 20;


This is a bit weird; these are constants that are initialized to static values that have nothing to do with the arguments. These values should be set where the variables are declared, with the variables made final. If you make a default no-op implementation of this method in the interface definiton (see my comment elsewhere), this override can go be eliminated entirely.

cmnbroad · 2025-07-21T20:38:23Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+    @Override
+    public void recordValues( final GATKRead read, final SAMFileHeader header, final PerReadCovariateMatrix values, final boolean recordIndelValues) {
+        // store the original bases and then write Ns over low quality ones
+        final byte[] originalBases = Arrays.copyOf(read.getBases(), read.getBases().length);


There is a lot of unnecessary/redundant allocation/copying of the read bases in this method (which is probably contributing to the perf problems you've mentioned). GATKRead.readBases is documented as making a defensive copy to start with. Thats 3 copies just for this line, plus the additional call below, without even considering the conditional reverse complementing.

cmnbroad · 2025-07-21T20:45:13Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+            //
+            // backward repeat unit *includes* the offset base (unlike the forward repeat unit)
+            final byte[] backwardRepeatUnit = Arrays.copyOfRange(readBases, offset - strSize + 1, offset + 1);
+            String backwardRepeatUnitStr = new String(backwardRepeatUnit, StandardCharsets.UTF_8);


This string is unused.

cmnbroad · 2025-07-21T20:48:14Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+
+                // get forward repeat unit and # repeats (offset + 1 .... so we don't include the base at offset...)
+                byte[] forwardRepeatUnit = Arrays.copyOfRange(readBases, offset + 1, offset + strLength + 1);
+                String forwardRepeatUnitStr = new String(forwardRepeatUnit, StandardCharsets.UTF_8);


This string is unused.

cmnbroad · 2025-07-21T20:48:44Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+        final boolean negativeStrand = read.isReverseStrand();
+        byte[] bases = read.getBases();
+        if (negativeStrand)
+            bases = BaseUtils.simpleReverseComplement(bases);


Brackets for all these conditionals please.

Also, it seems like you don't really need ANY copies for the forward strand case.

cmnbroad · 2025-07-21T20:51:47Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+
+        // don't record reads with N's
+        if (!BaseUtils.isAllRegularBases(bases))
+            return;


Given the perf issues you've described, it might be worth trying to do this short circuit before any copying happens.

cmnbroad · 2025-07-21T21:04:08Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+    }
+
+    protected String getCovariateValueFromUnitAndLength(final byte[] repeatFromUnitAndLength, final int repeatLength) {
+        return String.format("%d", repeatLength);


repeatFromUnitAndLength is unused.

cmnbroad · 2025-07-21T21:08:41Z

...java/org/broadinstitute/hellbender/utils/recalibration/covariates/RepeatLengthCovariate.java

+            final Pair<byte[], Integer> res = findTandemRepeatUnits(bases, i);
+            // to merge repeat unit and repeat length to get covariate value:
+            final String repeatID =  getCovariateValueFromUnitAndLength(res.getLeft(),  res.getRight());
+            final int key = keyForRepeat(repeatID);


Lines 49-50 take an int and convert it to a String to use as a map key. It seems like storing the repeatID in the map as an Integer might eliminate a lot of these String conversions. Not sure if that causes other problems, as I'm still understanding how these covariates work, but it might be better to keep ints internally, and then only convert to string when required by the covariate interface.

cmnbroad

Quite a bit more here to address, and a bunch more tests required. I still haven't even looked at the actual recal tables yet.

cmnbroad · 2025-07-28T14:38:08Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+    private final Map<Class<? extends Covariate>, Integer> indexByClass;
+
+    private static final List<String> REQUIRED_COVARIATE_NAMES =
+            Collections.unmodifiableList(Arrays.asList("ReadGroupCovariate", "QualityScoreCovariate"));


Since these strings are/must be the names of classes, they should really be initialized from the class names themselves (i.e., ReadGroupCovariate.class.getSimpleName() - there are a few name methods so you might have to verify that getSimpleName is the correct one to use).

cmnbroad · 2025-07-28T14:39:16Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+
+    public static boolean isStandardCovariate(final String covariateName) {
+        return STANDARD_COVARIATE_NAMES.contains(covariateName);
+    }


These two methods (isRequiredCovariate and isStandardCovariate) are public but are only used internal to this class. Unless there is some very compelling reason to keep them, they should either be private or eliminated entirely.

cmnbroad · 2025-07-28T14:41:55Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+
+        for ( final String covariatePackage : COVARIATE_PACKAGES ) {
+            classFinder.find(covariatePackage, Covariate.class);
+        }


I'm not sure if this is documented elsewhere, but somewhere it should be stated that in order for a covariate to be discoverable, it needs to be in the covariates package.

cmnbroad · 2025-07-28T14:43:13Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+            Collections.unmodifiableList(Arrays.asList("ContextCovariate", "CycleCovariate"));
+
+    public static final List<String> COVARIATE_PACKAGES =
+            Collections.unmodifiableList(Arrays.asList("org.broadinstitute.hellbender.utils.recalibration.covariates"));


COVARIATE_PACKAGES -> COVARIATE_PACKAGE ? Its a bit weird to have this as a list, and have a plural name, when it appears that only a single package is supported. COVARIATE_PACKAGE.

cmnbroad · 2025-07-28T14:45:24Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+    public static final int BASE_QUALITY_COVARIATE_DEFAULT_INDEX = 1;
+    public static final int CONTEXT_COVARIATE_DEFAULT_INDEX = 2;
+    public static final int CYCLE_COVARIATE_DEFAULT_INDEX = 3;
+    public static final int NUM_REQUIRED_COVARITES = 2;


This is misspelled (NUM_REQUIRED_COVARITES -> NUM_REQUIRED_COVARIATES), but more importantly, its redundant with the similarly named variable in RecalUtils (which is the one used by the recal engine).

cmnbroad · 2025-07-28T20:44:09Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+                try {
+                    @SuppressWarnings("unchecked")
+                    final Covariate covariate = ((Class<? extends Covariate>)covariateClass).getDeclaredConstructor().newInstance();
+                    covariate.initialize(rac, allReadGroups);


As mentioned elsewhere, this COULD just look for a constructor with the right args.

cmnbroad · 2025-07-28T20:51:31Z

...ain/java/org/broadinstitute/hellbender/utils/recalibration/covariates/BQSRCovariateList.java

+     */
+    public static int numberOfRequiredCovariates() {
+        return REQUIRED_COVARIATE_NAMES.size();
+    }


This method is unused. Ideally it would be removed, but if you need to keep it, it should rely on the static named constant that means the same thing.

cmnbroad · 2025-07-28T20:56:54Z

.../java/org/broadinstitute/hellbender/utils/recalibration/RecalibrationArgumentCollection.java

+     *
+     */
+    @Argument(fullName = COVARIATES_LONG_NAME, shortName = COVARIATES_SHORT_NAME, doc = "One or more covariates to be used in the recalibration. Can be specified multiple times", optional = true)
+    public List<String> COVARIATES = new ArrayList<>();


Isn't this really only "custom covariates" ? If so, it would be clearer if the variable's name reflected that.

cmnbroad · 2025-07-28T21:00:23Z

.../java/org/broadinstitute/hellbender/utils/recalibration/RecalibrationArgumentCollection.java

+     * Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default.
+     * Use the --list argument to see the available covariates.
+     *
+     */


Its good that the doc for this references the default and required stuff, but would be clearer if it stated that it's for "custom" or "additional" covariates that are not hte default or required ones, i.e.:
/**
Custom or additional covariates. ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default.
*/
Or something like that.

cmnbroad · 2025-07-28T21:01:50Z

.../java/org/broadinstitute/hellbender/utils/recalibration/RecalibrationArgumentCollection.java

+                                               final String thisRole, final String otherRole) {
+
+        final Set<String> beforeNames = new HashSet<>(this.COVARIATES.size());
+        final Set<String> afterNames = new HashSet<>(other.COVARIATES.size());


So this is only comparing the "custom" covariates (since its using the COVARIATES variable) ? Is that correct ?

droazen and others added 9 commits July 21, 2025 12:52

BQSR: restore ability to specify custom covariates and to disable sta…

03b0478

…ndard covariates Added a quick-and-dirty temporary port of RepeatLengthCovariate to test this functionality. Resolves #4568

Fix tests

b21d32e

rebase clean up

255fb76

unit tests passing

3fcc21e

getting closer to review

759d454

cleaning up

8362803

more cleaning up

82f33ca

remove experiments

b92f28d

proofreading

014145c

takutosato force-pushed the ts_david_roazen branch from a20d9b6 to 014145c Compare July 21, 2025 16:53

cmnbroad requested changes Jul 21, 2025

View reviewed changes

cmnbroad reviewed Jul 28, 2025

View reviewed changes

BQSR: restore ability to specify custom covariates and to disable standard covariates. (Take 2) #9206

Are you sure you want to change the base?

BQSR: restore ability to specify custom covariates and to disable standard covariates. (Take 2) #9206

Uh oh!

Conversation

takutosato commented Jun 12, 2025

Uh oh!

gatk-bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmnbroad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmnbroad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gatk-bot commented Jun 12, 2025 •

edited

Loading

cmnbroad left a comment •

edited

Loading