Skip to content

Conversation

ForeverAngry
Copy link
Contributor

@ForeverAngry ForeverAngry commented Aug 22, 2025

GitHub Copilot

Rationale for this change

Add richer snapshot expiration retention capabilities:

  • retain_last_n(n): keep N most recent snapshots regardless of age.
  • older_than_with_retention(...): combine max age filter with retain_last_n and a minimum snapshots floor.
  • with_retention_policy(...): unified API that also consumes table property defaults:
    • history.expire.max-snapshot-age-ms
    • history.expire.min-snapshots-to-keep
    • history.expire.max-ref-age-ms (plumbed for future ref expiration logic)
      Internal helpers:
  • _get_expiration_properties: parses table properties.
  • _get_snapshots_to_expire_with_retention: core selection logic factoring protected refs (branches/tags), age, recency, and minimum keep guardrail.
    Motivation: safer, configurable space reclamation while preventing accidental over‑expiration.
  • Fixed the return type here to be consistent with the return type declaration for the rest of the class ExpireSnapshot -> "ExpireSnapshots"

Are these changes tested?

Yes. Extended test_expire_snapshots.py with new cases:

  • test_retain_last_n_with_protection
  • test_older_than_with_retention_combination
  • test_with_retention_policy_defaults
  • test_get_expiration_properties
  • test_get_snapshots_to_expire_with_retention_respects_protection
    Tests cover interaction of all parameters, property default fallback, protection of branch/tag snapshots, and minimum keep enforcement.

Are there any user-facing changes?

Yes :

  • New public methods: retain_last_n, older_than_with_retention, with_retention_policy on ExpireSnapshots.
  • Table properties history.expire.max-snapshot-age-ms and history.expire.min-snapshots-to-keep now auto-applied by with_retention_policy.
    No breaking changes to existing APIs; older_than and by_id/by_ids behavior unchanged. Documentation/changelog should be updated to reflect new retention strategy APIs and property usage.

@ForeverAngry ForeverAngry marked this pull request as ready for review August 31, 2025 13:18
Copy link
Contributor

@Anton-Tarazi Anton-Tarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! Left some comments

Comment on lines 1109 to 1127
protected_ids = self._get_protected_snapshot_ids()

# Sort snapshots by timestamp (most recent first)
sorted_snapshots = sorted(self._transaction.table_metadata.snapshots, key=lambda s: s.timestamp_ms, reverse=True)

# Keep the last N snapshots and all protected ones
snapshots_to_keep = set()
snapshots_to_keep.update(protected_ids)

# Add the N most recent snapshots
for i, snapshot in enumerate(sorted_snapshots):
if i < n:
snapshots_to_keep.add(snapshot.snapshot_id)

# Find snapshots to expire
snapshots_to_expire = []
for snapshot in self._transaction.table_metadata.snapshots:
if snapshot.snapshot_id not in snapshots_to_keep:
snapshots_to_expire.append(snapshot.snapshot_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected_ids = self._get_protected_snapshot_ids()
# Sort snapshots by timestamp (most recent first)
sorted_snapshots = sorted(self._transaction.table_metadata.snapshots, key=lambda s: s.timestamp_ms, reverse=True)
# Keep the last N snapshots and all protected ones
snapshots_to_keep = set()
snapshots_to_keep.update(protected_ids)
# Add the N most recent snapshots
for i, snapshot in enumerate(sorted_snapshots):
if i < n:
snapshots_to_keep.add(snapshot.snapshot_id)
# Find snapshots to expire
snapshots_to_expire = []
for snapshot in self._transaction.table_metadata.snapshots:
if snapshot.snapshot_id not in snapshots_to_keep:
snapshots_to_expire.append(snapshot.snapshot_id)
snapshots_to_keep = self._get_protected_snapshot_ids()
# Sort snapshots by timestamp (most recent first), and get most recent N
sorted_snapshots = sorted(self._transaction.table_metadata.snapshots, key=lambda s: s.timestamp_ms, reverse=True)
snapshots_to_keep.update(snapshot.snapshot_id for snapshot in sorted_snapshots[:n])
snapshots_to_expire = [id for snapshot in self._transaction.table_metadata.snapshots if (id := snapshot.snapshot_id) not in snapshots_to_keep]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small syntax change to make more pythonic :)

"""
properties = self._transaction.table_metadata.properties

max_snapshot_age_ms = properties.get("history.expire.max-snapshot-age-ms")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this string and the default value be a named constant somewhere? What do you think about using property_as_int from properties.py to be consistent with how properties are handled elsewhere?

}

def by_id(self, snapshot_id: int) -> ExpireSnapshots:
def by_id(self, snapshot_id: int) -> "ExpireSnapshots":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw since we have from __future__ import annotations at the top of the file I think its cleaner to make things consistent to not have quotes. Probably outside of the scope of this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a great point. I can log an issue to address these all at one.

Comment on lines 1145 to 1158
protected_ids = self._get_protected_snapshot_ids()

# Sort snapshots by timestamp (most recent first)
sorted_snapshots = sorted(self._transaction.table_metadata.snapshots, key=lambda s: s.timestamp_ms, reverse=True)

# Start with all snapshots that could be expired
candidates_for_expiration = []
snapshots_to_keep = set(protected_ids)

# Apply retain_last_n constraint
if retain_last_n is not None:
for i, snapshot in enumerate(sorted_snapshots):
if i < retain_last_n:
snapshots_to_keep.add(snapshot.snapshot_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is the same as in retain_last_n, can we refactor to its own function? I think we also need to handle branches and take the last n of each branch

Comment on lines +1161 to +1163
for snapshot in self._transaction.table_metadata.snapshots:
if snapshot.snapshot_id not in snapshots_to_keep and (timestamp_ms is None or snapshot.timestamp_ms < timestamp_ms):
candidates_for_expiration.append(snapshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make more pythonic with comprehension?

Comment on lines 1165 to 1180
# Sort candidates by timestamp (oldest first) for potential expiration
candidates_for_expiration.sort(key=lambda s: s.timestamp_ms)

# Apply min_snapshots_to_keep constraint
total_snapshots = len(self._transaction.table_metadata.snapshots)
snapshots_to_expire: List[int] = []

for candidate in candidates_for_expiration:
# Check if expiring this snapshot would violate min_snapshots_to_keep
remaining_after_expiration = total_snapshots - len(snapshots_to_expire) - 1

if min_snapshots_to_keep is None or remaining_after_expiration >= min_snapshots_to_keep:
snapshots_to_expire.append(candidate.snapshot_id)
else:
# Stop expiring to maintain minimum count
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Sort candidates by timestamp (oldest first) for potential expiration
candidates_for_expiration.sort(key=lambda s: s.timestamp_ms)
# Apply min_snapshots_to_keep constraint
total_snapshots = len(self._transaction.table_metadata.snapshots)
snapshots_to_expire: List[int] = []
for candidate in candidates_for_expiration:
# Check if expiring this snapshot would violate min_snapshots_to_keep
remaining_after_expiration = total_snapshots - len(snapshots_to_expire) - 1
if min_snapshots_to_keep is None or remaining_after_expiration >= min_snapshots_to_keep:
snapshots_to_expire.append(candidate.snapshot_id)
else:
# Stop expiring to maintain minimum count
break
# Sort candidates by timestamp (newest first) for potential expiration
candidates_for_expiration.sort(key=lambda s: s.timestamp_ms, reverse=True)
snapshots_to_expire = candidates_for_expiration[min_snapshots_to_keep:]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check that I didn't make an off-by-one error here but I believe this is a more concise way to express things :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants