Skip to content

Health Checks & Doctor CLI

Papyra treats persistence health as a first‑class operational concern, not an afterthought. Corruption, partial writes, truncated files, or incompatible formats must be detected before an actor system starts processing messages.

This document explains:

  • What “health” means in Papyra
  • How scans, recovery, and startup checks work
  • How to use the doctor CLI safely in real systems
  • When to automate vs when to intervene manually

What Is Persistence Health?

Persistence health answers one simple question:

Can this persistence backend be trusted to load and append data safely?

A backend is considered healthy if:

  • All persisted records are structurally valid
  • No truncated or malformed entries exist
  • The backend can guarantee forward‑only appends
  • Retention and compaction rules can be applied safely

Health checks are read‑only by default.


The Scan Phase

A scan inspects persistence storage without modifying it.

What a Scan Detects

Depending on backend type, a scan may detect:

  • Truncated JSON lines
  • Invalid JSON payloads
  • Missing required fields
  • Corrupted Redis stream entries
  • Inconsistent metadata

What a Scan Never Does

  • It does not delete data
  • It does not repair corruption
  • It does not rewrite files

Scans are safe to run at any time.


The Doctor CLI

The doctor command is a standalone pre‑flight tool. It runs the same health logic used during system startup, but with explicit CLI control.

papyra doctor run

By default, Doctor runs in FAIL_ON_ANOMALY mode.


Doctor Modes

IGNORE

papyra doctor run --mode ignore
  • Scans persistence
  • Reports anomalies
  • Always exits with code 0

Use cases

  • Diagnostics
  • Monitoring
  • Non‑blocking CI checks

FAIL_ON_ANOMALY (default)

papyra doctor run --mode fail_on_anomaly
  • Scans persistence
  • If anomalies exist → exits immediately with non‑zero status

Use cases

  • Production startup gates
  • Kubernetes initContainers
  • CI/CD deployment checks

This mode prevents unsafe startup.


RECOVER

papyra doctor run --mode recover --recovery-mode repair
  • Scans persistence
  • Attempts recovery
  • Re‑scans after recovery
  • Fails if anomalies remain

Recovery is explicit — nothing is repaired unless you ask.


Recovery Modes

REPAIR

--recovery-mode repair
  • Removes corrupted records in place
  • Preserves valid data
  • May rewrite files or trim streams

Used when corruption is acceptable to discard.


QUARANTINE

--recovery-mode quarantine --quarantine-dir ./quarantine
  • Moves corrupted records aside
  • Preserves original data for inspection
  • Safest option for production incidents

If --quarantine-dir is missing, Doctor fails immediately.


Exit Codes

Doctor uses meaningful exit codes for automation:

Code Meaning
0 Healthy or recovery successful
1 Anomalies detected (FAIL_ON_ANOMALY)
2 Recovery attempted but anomalies remain
non‑numeric Invalid configuration

Relationship to Startup Checks

The Doctor CLI mirrors the internal startup logic used by ActorSystem.

Internally, Papyra runs:

  • scan()
  • Optional recover()
  • Verification scan

Doctor allows you to run the same logic manually, before starting actors.


When to Use Doctor

  • Before deploying new versions
  • Before migrating persistence formats
  • As a Kubernetes initContainer
  • After crashes or power loss
  • Before enabling retention or compaction

Not Required

  • For in‑memory persistence
  • For test environments (unless debugging corruption)

Example: Safe Production Startup

papyra doctor run --mode fail_on_anomaly
papyra persistence compact
papyra start

This guarantees: - No corrupted data is loaded - Storage is compacted - Actors only start on trusted data


Design Philosophy

Doctor exists because silent corruption is worse than downtime.

Papyra always chooses:

  • Explicit failure over silent recovery
  • Human‑visible output over magic
  • Deterministic exits over best‑effort guesses

If Doctor fails, it is telling you something important.

Listen to it.