Health Checks & Doctor CLI¶
Papyra treats persistence health as a first‑class operational concern, not an afterthought. Corruption, partial writes, truncated files, or incompatible formats must be detected before an actor system starts processing messages.
This document explains:
- What “health” means in Papyra
- How scans, recovery, and startup checks work
- How to use the
doctorCLI safely in real systems - When to automate vs when to intervene manually
What Is Persistence Health?¶
Persistence health answers one simple question:
Can this persistence backend be trusted to load and append data safely?
A backend is considered healthy if:
- All persisted records are structurally valid
- No truncated or malformed entries exist
- The backend can guarantee forward‑only appends
- Retention and compaction rules can be applied safely
Health checks are read‑only by default.
The Scan Phase¶
A scan inspects persistence storage without modifying it.
What a Scan Detects¶
Depending on backend type, a scan may detect:
- Truncated JSON lines
- Invalid JSON payloads
- Missing required fields
- Corrupted Redis stream entries
- Inconsistent metadata
What a Scan Never Does¶
- It does not delete data
- It does not repair corruption
- It does not rewrite files
Scans are safe to run at any time.
The Doctor CLI¶
The doctor command is a standalone pre‑flight tool.
It runs the same health logic used during system startup, but with explicit CLI control.
papyra doctor run
By default, Doctor runs in FAIL_ON_ANOMALY mode.
Doctor Modes¶
IGNORE¶
papyra doctor run --mode ignore
- Scans persistence
- Reports anomalies
- Always exits with code
0
Use cases
- Diagnostics
- Monitoring
- Non‑blocking CI checks
FAIL_ON_ANOMALY (default)¶
papyra doctor run --mode fail_on_anomaly
- Scans persistence
- If anomalies exist → exits immediately with non‑zero status
Use cases
- Production startup gates
- Kubernetes initContainers
- CI/CD deployment checks
This mode prevents unsafe startup.
RECOVER¶
papyra doctor run --mode recover --recovery-mode repair
- Scans persistence
- Attempts recovery
- Re‑scans after recovery
- Fails if anomalies remain
Recovery is explicit — nothing is repaired unless you ask.
Recovery Modes¶
REPAIR¶
--recovery-mode repair
- Removes corrupted records in place
- Preserves valid data
- May rewrite files or trim streams
Used when corruption is acceptable to discard.
QUARANTINE¶
--recovery-mode quarantine --quarantine-dir ./quarantine
- Moves corrupted records aside
- Preserves original data for inspection
- Safest option for production incidents
If --quarantine-dir is missing, Doctor fails immediately.
Exit Codes¶
Doctor uses meaningful exit codes for automation:
| Code | Meaning |
|---|---|
| 0 | Healthy or recovery successful |
| 1 | Anomalies detected (FAIL_ON_ANOMALY) |
| 2 | Recovery attempted but anomalies remain |
| non‑numeric | Invalid configuration |
Relationship to Startup Checks¶
The Doctor CLI mirrors the internal startup logic used by ActorSystem.
Internally, Papyra runs:
scan()- Optional
recover() - Verification scan
Doctor allows you to run the same logic manually, before starting actors.
When to Use Doctor¶
Recommended¶
- Before deploying new versions
- Before migrating persistence formats
- As a Kubernetes initContainer
- After crashes or power loss
- Before enabling retention or compaction
Not Required¶
- For in‑memory persistence
- For test environments (unless debugging corruption)
Example: Safe Production Startup¶
papyra doctor run --mode fail_on_anomaly
papyra persistence compact
papyra start
This guarantees: - No corrupted data is loaded - Storage is compacted - Actors only start on trusted data
Design Philosophy¶
Doctor exists because silent corruption is worse than downtime.
Papyra always chooses:
- Explicit failure over silent recovery
- Human‑visible output over magic
- Deterministic exits over best‑effort guesses
If Doctor fails, it is telling you something important.
Listen to it.