-
Notifications
You must be signed in to change notification settings - Fork 614
BQSR: restore ability to specify custom covariates and to disable standard covariates. (Take 2) #9206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Github actions tests reported job failures from actions build 15620149812
|
…ndard covariates Added a quick-and-dirty temporary port of RepeatLengthCovariate to test this functionality. Resolves #4568
a20d9b6
to
014145c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing what I have for now. More next week. A couple of meta-comments:
- There is no BQSRIntegration test (only an Apply integration test - or maybe thats the only way ?)
- The Spark tools/pipelines don’t handle the new args added to the arg collection. It may not be worth integrating them there, but we should at least reject them if/when they’re specified.
* @param value Covariate value | ||
* @return Split pair | ||
*/ | ||
public static Pair<String,Integer> getRUandNRfromCovariate(final String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused. Can it be removed ?
* @param RAC the recalibration argument collection | ||
* @param readGroups (only used by the ReadGroup covariate --- consider refactoring) | ||
*/ | ||
public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of adding this method, since it requires many implementing classes which are currently immutable to now use mutable variables. We could just use a constructor as was done previously (instead of assuming a no-arg constructor, you could find and call a constructor with the expected signature (final RecalibrationArgumentCollection RAC, final List<String> readGroups
), which can you do via reflection.
Alternatively, if you want to keep this method, then I would mark it as "default", and add an empty implementation. Then at least the implementations that don't use these args won't have to implement awkward boilerplate code that does nothing.
Also, public is redundant in interfaces (although it looks like the existing code already uses it).
@Override | ||
public void initialize(final RecalibrationArgumentCollection RAC, final List<String> readGroups) { | ||
MAX_STR_UNIT_LENGTH = 8; | ||
MAX_REPEAT_LENGTH = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird; these are constants that are initialized to static values that have nothing to do with the arguments. These values should be set where the variables are declared, with the variables made final. If you make a default no-op implementation of this method in the interface definiton (see my comment elsewhere), this override can go be eliminated entirely.
@Override | ||
public void recordValues( final GATKRead read, final SAMFileHeader header, final PerReadCovariateMatrix values, final boolean recordIndelValues) { | ||
// store the original bases and then write Ns over low quality ones | ||
final byte[] originalBases = Arrays.copyOf(read.getBases(), read.getBases().length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of unnecessary/redundant allocation/copying of the read bases in this method (which is probably contributing to the perf problems you've mentioned). GATKRead.readBases is documented as making a defensive copy to start with. Thats 3 copies just for this line, plus the additional call below, without even considering the conditional reverse complementing.
// | ||
// backward repeat unit *includes* the offset base (unlike the forward repeat unit) | ||
final byte[] backwardRepeatUnit = Arrays.copyOfRange(readBases, offset - strSize + 1, offset + 1); | ||
String backwardRepeatUnitStr = new String(backwardRepeatUnit, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string is unused.
|
||
// get forward repeat unit and # repeats (offset + 1 .... so we don't include the base at offset...) | ||
byte[] forwardRepeatUnit = Arrays.copyOfRange(readBases, offset + 1, offset + strLength + 1); | ||
String forwardRepeatUnitStr = new String(forwardRepeatUnit, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string is unused.
final boolean negativeStrand = read.isReverseStrand(); | ||
byte[] bases = read.getBases(); | ||
if (negativeStrand) | ||
bases = BaseUtils.simpleReverseComplement(bases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brackets for all these conditionals please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems like you don't really need ANY copies for the forward strand case.
|
||
// don't record reads with N's | ||
if (!BaseUtils.isAllRegularBases(bases)) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the perf issues you've described, it might be worth trying to do this short circuit before any copying happens.
} | ||
|
||
protected String getCovariateValueFromUnitAndLength(final byte[] repeatFromUnitAndLength, final int repeatLength) { | ||
return String.format("%d", repeatLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeatFromUnitAndLength is unused.
final Pair<byte[], Integer> res = findTandemRepeatUnits(bases, i); | ||
// to merge repeat unit and repeat length to get covariate value: | ||
final String repeatID = getCovariateValueFromUnitAndLength(res.getLeft(), res.getRight()); | ||
final int key = keyForRepeat(repeatID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 49-50 take an int and convert it to a String to use as a map key. It seems like storing the repeatID in the map as an Integer might eliminate a lot of these String conversions. Not sure if that causes other problems, as I'm still understanding how these covariates work, but it might be better to keep ints internally, and then only convert to string when required by the covariate interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a bit more here to address, and a bunch more tests required. I still haven't even looked at the actual recal tables yet.
private final Map<Class<? extends Covariate>, Integer> indexByClass; | ||
|
||
private static final List<String> REQUIRED_COVARIATE_NAMES = | ||
Collections.unmodifiableList(Arrays.asList("ReadGroupCovariate", "QualityScoreCovariate")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these strings are/must be the names of classes, they should really be initialized from the class names themselves (i.e., ReadGroupCovariate.class.getSimpleName()
- there are a few name methods so you might have to verify that getSimpleName
is the correct one to use).
|
||
public static boolean isStandardCovariate(final String covariateName) { | ||
return STANDARD_COVARIATE_NAMES.contains(covariateName); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two methods (isRequiredCovariate
and isStandardCovariate
) are public but are only used internal to this class. Unless there is some very compelling reason to keep them, they should either be private or eliminated entirely.
|
||
for ( final String covariatePackage : COVARIATE_PACKAGES ) { | ||
classFinder.find(covariatePackage, Covariate.class); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is documented elsewhere, but somewhere it should be stated that in order for a covariate to be discoverable, it needs to be in the covariates package.
Collections.unmodifiableList(Arrays.asList("ContextCovariate", "CycleCovariate")); | ||
|
||
public static final List<String> COVARIATE_PACKAGES = | ||
Collections.unmodifiableList(Arrays.asList("org.broadinstitute.hellbender.utils.recalibration.covariates")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COVARIATE_PACKAGES
-> COVARIATE_PACKAGE
? Its a bit weird to have this as a list, and have a plural name, when it appears that only a single package is supported. COVARIATE_PACKAGE
.
public static final int BASE_QUALITY_COVARIATE_DEFAULT_INDEX = 1; | ||
public static final int CONTEXT_COVARIATE_DEFAULT_INDEX = 2; | ||
public static final int CYCLE_COVARIATE_DEFAULT_INDEX = 3; | ||
public static final int NUM_REQUIRED_COVARITES = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misspelled (NUM_REQUIRED_COVARITES
-> NUM_REQUIRED_COVARIATES
), but more importantly, its redundant with the similarly named variable in RecalUtils (which is the one used by the recal engine).
try { | ||
@SuppressWarnings("unchecked") | ||
final Covariate covariate = ((Class<? extends Covariate>)covariateClass).getDeclaredConstructor().newInstance(); | ||
covariate.initialize(rac, allReadGroups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned elsewhere, this COULD just look for a constructor with the right args.
*/ | ||
public static int numberOfRequiredCovariates() { | ||
return REQUIRED_COVARIATE_NAMES.size(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused. Ideally it would be removed, but if you need to keep it, it should rely on the static named constant that means the same thing.
* | ||
*/ | ||
@Argument(fullName = COVARIATES_LONG_NAME, shortName = COVARIATES_SHORT_NAME, doc = "One or more covariates to be used in the recalibration. Can be specified multiple times", optional = true) | ||
public List<String> COVARIATES = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this really only "custom covariates" ? If so, it would be clearer if the variable's name reflected that.
* Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default. | ||
* Use the --list argument to see the available covariates. | ||
* | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its good that the doc for this references the default and required stuff, but would be clearer if it stated that it's for "custom" or "additional" covariates that are not hte default or required ones, i.e.:
/**
Custom or additional covariates. ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no-standard-covariates is specified, the Cycle and Context covariates are standard and are included by default.
*/
Or something like that.
final String thisRole, final String otherRole) { | ||
|
||
final Set<String> beforeNames = new HashSet<>(this.COVARIATES.size()); | ||
final Set<String> afterNames = new HashSet<>(other.COVARIATES.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only comparing the "custom" covariates (since its using the COVARIATES variable) ? Is that correct ?
This PR continues @droazen's prior PR #4688 from circa 2018, where he restored the ability to add custom covariates, which I've been told was a feature in GATK3 but was removed from GATK4 for performance reasons. (@lbergelson please correct me if I'm wrong.) He also added a sample custom covariate,
RepeatLengthCovariate
.I rebased the branch, added comments where needed, and added tests.