Skip to content

Conversation

takutosato
Copy link
Contributor

This PR continues @droazen's prior PR #4688 from circa 2018, where he restored the ability to add custom covariates, which I've been told was a feature in GATK3 but was removed from GATK4 for performance reasons. (@lbergelson please correct me if I'm wrong.) He also added a sample custom covariate, RepeatLengthCovariate.

I rebased the branch, added comments where needed, and added tests.

@gatk-bot
Copy link

gatk-bot commented Jun 12, 2025

Github actions tests reported job failures from actions build 15620149812
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 17.0.6+10 15620149812.10 logs
integration 17.0.6+10 15620149812.11 logs
integration 17.0.6+10 15620149812.0 logs

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing what I have for now. More next week. A couple of meta-comments:

  • There is no BQSRIntegration test (only an Apply integration test - or maybe thats the only way ?)
  • The Spark tools/pipelines don’t handle the new args added to the arg collection. It may not be worth integrating them there, but we should at least reject them if/when they’re specified.

* @param value Covariate value
* @return Split pair
*/
public static Pair<String,Integer> getRUandNRfromCovariate(final String value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unused. Can it be removed ?

* @param RAC the recalibration argument collection
* @param readGroups (only used by the ReadGroup covariate --- consider refactoring)
*/
public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of adding this method, since it requires many implementing classes which are currently immutable to now use mutable variables. We could just use a constructor as was done previously (instead of assuming a no-arg constructor, you could find and call a constructor with the expected signature (final RecalibrationArgumentCollection RAC, final List<String> readGroups), which can you do via reflection.

Alternatively, if you want to keep this method, then I would mark it as "default", and add an empty implementation. Then at least the implementations that don't use these args won't have to implement awkward boilerplate code that does nothing.

Also, public is redundant in interfaces (although it looks like the existing code already uses it).

@Override
public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups) {
MAX_STR_UNIT_LENGTH = 8;
MAX_REPEAT_LENGTH = 20;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird; these are constants that are initialized to static values that have nothing to do with the arguments. These values should be set where the variables are declared, with the variables made final. If you make a default no-op implementation of this method in the interface definiton (see my comment elsewhere), this override can go be eliminated entirely.

@Override
public void recordValues( final GATKRead read, final SAMFileHeader header, final PerReadCovariateMatrix values, final boolean recordIndelValues) {
// store the original bases and then write Ns over low quality ones
final byte[] originalBases = Arrays.copyOf(read.getBases(), read.getBases().length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of unnecessary/redundant allocation/copying of the read bases in this method (which is probably contributing to the perf problems you've mentioned). GATKRead.readBases is documented as making a defensive copy to start with. Thats 3 copies just for this line, plus the additional call below, without even considering the conditional reverse complementing.

//
// backward repeat unit *includes* the offset base (unlike the forward repeat unit)
final byte[] backwardRepeatUnit = Arrays.copyOfRange(readBases, offset - strSize + 1, offset + 1);
String backwardRepeatUnitStr = new String(backwardRepeatUnit, StandardCharsets.UTF_8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string is unused.


// get forward repeat unit and # repeats (offset + 1 .... so we don't include the base at offset...)
byte[] forwardRepeatUnit = Arrays.copyOfRange(readBases, offset + 1, offset + strLength + 1);
String forwardRepeatUnitStr = new String(forwardRepeatUnit, StandardCharsets.UTF_8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string is unused.

final boolean negativeStrand = read.isReverseStrand();
byte[] bases = read.getBases();
if (negativeStrand)
bases = BaseUtils.simpleReverseComplement(bases);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brackets for all these conditionals please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems like you don't really need ANY copies for the forward strand case.


// don't record reads with N's
if (!BaseUtils.isAllRegularBases(bases))
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the perf issues you've described, it might be worth trying to do this short circuit before any copying happens.

}

protected String getCovariateValueFromUnitAndLength(final byte[] repeatFromUnitAndLength, final int repeatLength) {
return String.format("%d", repeatLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeatFromUnitAndLength is unused.

final Pair<byte[], Integer> res = findTandemRepeatUnits(bases, i);
// to merge repeat unit and repeat length to get covariate value:
final String repeatID = getCovariateValueFromUnitAndLength(res.getLeft(), res.getRight());
final int key = keyForRepeat(repeatID);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 49-50 take an int and convert it to a String to use as a map key. It seems like storing the repeatID in the map as an Integer might eliminate a lot of these String conversions. Not sure if that causes other problems, as I'm still understanding how these covariates work, but it might be better to keep ints internally, and then only convert to string when required by the covariate interface.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a bit more here to address, and a bunch more tests required. I still haven't even looked at the actual recal tables yet.

private final Map<Class<? extends Covariate>, Integer> indexByClass;

private static final List<String> REQUIRED_COVARIATE_NAMES =
Collections.unmodifiableList(Arrays.asList("ReadGroupCovariate", "QualityScoreCovariate"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these strings are/must be the names of classes, they should really be initialized from the class names themselves (i.e., ReadGroupCovariate.class.getSimpleName() - there are a few name methods so you might have to verify that getSimpleName is the correct one to use).


public static boolean isStandardCovariate(final String covariateName) {
return STANDARD_COVARIATE_NAMES.contains(covariateName);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two methods (isRequiredCovariate and isStandardCovariate) are public but are only used internal to this class. Unless there is some very compelling reason to keep them, they should either be private or eliminated entirely.


for ( final String covariatePackage : COVARIATE_PACKAGES ) {
classFinder.find(covariatePackage, Covariate.class);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is documented elsewhere, but somewhere it should be stated that in order for a covariate to be discoverable, it needs to be in the covariates package.

Collections.unmodifiableList(Arrays.asList("ContextCovariate", "CycleCovariate"));

public static final List<String> COVARIATE_PACKAGES =
Collections.unmodifiableList(Arrays.asList("org.broadinstitute.hellbender.utils.recalibration.covariates"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COVARIATE_PACKAGES -> COVARIATE_PACKAGE ? Its a bit weird to have this as a list, and have a plural name, when it appears that only a single package is supported. COVARIATE_PACKAGE.

public static final int BASE_QUALITY_COVARIATE_DEFAULT_INDEX = 1;
public static final int CONTEXT_COVARIATE_DEFAULT_INDEX = 2;
public static final int CYCLE_COVARIATE_DEFAULT_INDEX = 3;
public static final int NUM_REQUIRED_COVARITES = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misspelled (NUM_REQUIRED_COVARITES -> NUM_REQUIRED_COVARIATES), but more importantly, its redundant with the similarly named variable in RecalUtils (which is the one used by the recal engine).

try {
@SuppressWarnings("unchecked")
final Covariate covariate = ((Class<? extends Covariate>)covariateClass).getDeclaredConstructor().newInstance();
covariate.initialize(rac, allReadGroups);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere, this COULD just look for a constructor with the right args.

*/
public static int numberOfRequiredCovariates() {
return REQUIRED_COVARIATE_NAMES.size();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unused. Ideally it would be removed, but if you need to keep it, it should rely on the static named constant that means the same thing.

*
*/
@Argument(fullName = COVARIATES_LONG_NAME, shortName = COVARIATES_SHORT_NAME, doc = "One or more covariates to be used in the recalibration. Can be specified multiple times", optional = true)
public List<String> COVARIATES = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this really only "custom covariates" ? If so, it would be clearer if the variable's name reflected that.

* Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default.
* Use the --list argument to see the available covariates.
*
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good that the doc for this references the default and required stuff, but would be clearer if it stated that it's for "custom" or "additional" covariates that are not hte default or required ones, i.e.:
/**
Custom or additional covariates. ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default.
*/
Or something like that.

final String thisRole, final String otherRole) {

final Set<String> beforeNames = new HashSet<>(this.COVARIATES.size());
final Set<String> afterNames = new HashSet<>(other.COVARIATES.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only comparing the "custom" covariates (since its using the COVARIATES variable) ? Is that correct ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants