Skip to content

Compaction Strategies

Compaction is the physical cleanup phase of Papyra persistence. Where retention defines what is no longer valid, compaction defines when and how invalid data is physically removed.

This document explains compaction deeply: why it exists, how it works across backends, and how to operate it safely in production.


1. Retention vs Compaction (Critical Distinction)

Papyra intentionally separates logical retention from physical compaction.

Concept What it does When it runs
Retention Decides which records are expired During reads / scans
Compaction Rewrites storage to remove expired data Explicit operation

Important

Retention alone does not shrink disk usage. Only compaction does.

This separation guarantees:

  • Deterministic reads
  • Crash-safe storage
  • No hidden I/O work during normal operation

2. Why Compaction Is Explicit

Many systems compact automatically. Papyra does not, by design.

Reasons:

  • Compaction is I/O heavy
  • Compaction may lock files or streams
  • Operators must control when it happens
  • Predictability > convenience

Instead, Papyra exposes compaction via:

  • CLI
  • API
  • Operator scheduling (cron, systemd, Kubernetes jobs)

3. Backend-Specific Compaction Behavior

3.1 JSON File Persistence

Mechanism

  • Reads the source .ndjson
  • Applies retention rules
  • Writes a new compacted file
  • Atomically replaces the original

Guarantees

  • Crash-safe (temp file + rename)
  • No partial writes
  • Original file preserved until success

Disk Impact

  • Temporary double disk usage during compaction

Typical Use

papyra persistence compact

3.2 Rotating File Persistence

Mechanism

  • Iterates rotated segments
  • Drops fully expired segments
  • Rewrites partially expired segments
  • Renames safely

Advantages

  • Faster compaction than monolithic files
  • Natural disk locality
  • Easier recovery

Best Practice

  • Pair with size-based rotation
  • Compact during off-peak hours

3.3 Redis Streams

Redis does not support true compaction in the filesystem sense.

Papyra uses:

  • XTRIM with retention-derived bounds
  • Optional approximate trimming (~)

Trade-offs

Mode Behavior
Exact Strong guarantees, slower
Approximate Faster, slightly lossy

Compaction here means:

“Advance the stream head to forget expired history”


3.4 In-Memory Persistence

  • No-op
  • Python garbage collection handles cleanup
  • Compaction exists for API symmetry only

4. Compaction Triggers

Papyra never auto-compacts.

You trigger compaction via:

CLI

papyra persistence compact

API

await system.persistence.compact()

Automation

  • Cron
  • systemd timer
  • Kubernetes Job / CronJob

5. Safe Compaction Windows

Recommended times:

  • Low traffic periods
  • After large retention drops
  • Before backups
  • After incident recovery

Avoid:

  • During heavy write bursts
  • During recovery operations
  • While disk is near full capacity

6. Observability During Compaction

If metrics are enabled, compaction emits:

  • Records scanned
  • Records dropped
  • Files rewritten
  • Errors encountered

CLI example:

papyra metrics persistence

7. Failure Modes & Guarantees

Crash During Compaction

✔ Original data preserved ✔ Temporary files discarded ✔ Safe to retry

Disk Full

  • Compaction aborts
  • No data loss
  • Operator must free space

Partial Backend Support

  • Backends may return None
  • CLI reports best-effort completion

8. Compaction vs Recovery

Operation Purpose
Recovery Fix corruption
Compaction Reduce size

Never confuse the two.

Recovery may rewrite, but its goal is correctness — not size.


9. Real-World Compaction Strategies

High-Volume Event Systems

  • Daily retention
  • Weekly compaction

Compliance Systems

  • Long retention
  • Monthly compaction + archive

Embedded / Edge Systems

  • Aggressive retention
  • Frequent compaction

10. Design Philosophy Recap

Papyra compaction is:

  • Explicit
  • Predictable
  • Crash-safe
  • Backend-aware

You control when storage is rewritten — not the framework.