AI News Hub Logo

AI News Hub

Deterministic Time: Don’t Break Ordering and Replay in Distributed Systems

DEV Community
kanaria007

In distributed systems, what we actually want is not “the correct time.” We want two things: Ordering: don’t lose which came first. Replay: later, explain the same decision with the same grounds and procedure. But in real systems we casually lean on created_at (wall clock). NTP adjustments / VM migration / suspend-resume makes clocks jump backward or forward Prometheus / logs / warehouse don’t align to the “same 5-minute window” log order collapses → postmortems become “guessing + meetings” If you have TrueTime-like “strong time,” life is easier. Don’t treat time as truth. Treat ordering authority as truth — and pin it to logs. That’s what I mean by deterministic time. (Target reader: engineers/SREs who have touched Kafka/Prometheus/etc. Code examples are minimal and portable.) Treat these as different things: event_time: when the event happened (client/device/source) ingest_time: when the system received/ingested it (server/pipeline boundary) decision_time: the authority used to decide ordering (e.g., seq, watermark conditions, promotion gate) (Not necessarily a wall-clock timestamp.) Wall clocks (created_at) are mostly for display and search. seq) The strongest, simplest option is: Pin ordering to a monotonically increasing seq issued by a single authority (a DB sequence). Numbering is truth. seq is the truth of order { "seq": 10498231, "ingest_time": "2026-02-20T00:12:01+09:00", "event_time": "2026-02-20T00:11:57+09:00", "kind": "rollout_decision", "run_id": "exp-42:B:2026-02-18", "stage": "CANARY_25", "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] } } Ordering is defined by seq ascending. event_time may be wrong or skewed — that’s fine; it’s reference info. ingest_time is for measuring delay. node clocks can drift; the sequencer still serializes you can replay in the same order later “which time is correct?” debates disappear — seq is the authority seq doesn’t have to be gapless; treat it as an ordering key, not a count seq comes from PostgreSQL: BIGSERIAL, GENERATED AS IDENTITY, sequences MySQL: AUTO_INCREMENT Kafka: partition offset (very strong within a partition, not global) SQLite (not recommended as a distributed sequencer): fine for local tooling; audit-grade monotonicity has edge cases (rebuilds / reuse) unless you design for append-only A single sequencer can become: a throughput bottleneck a single point of failure (ordering stops) The move is not “abandon determinism.” It’s: Decide how global your ordering must be. Practical levels: (A) Key-scoped total order: tenant_id / run_id / aggregate_id each has its own ordering lane (This is enough for many operational decisions.) (B) Partition order: multiple lanes (Kafka partitions) with strong lane-local order (C) Final consolidation (only when needed): create a final ordering key downstream for audits/reports This post focuses on: total order within a key strong order within a partition final ordering for replay/audit If you require real-time external consistency across partitions (global total order + distributed commit semantics), you’re now in distributed transaction territory (2PC / consensus / Spanner-class designs). This post gives you a replay/audit foundation, not a replacement. Even if order is pinned by seq, ops still fails if logs are not replayable. Replay is not “we read logs and feel convinced.” Run the same evaluator with the same inputs in the same order, and reproduce the same verdict. If you plan to pin input_digest / snapshot_digest, you must decide how bytes become a digest. A common failure mode is “we hashed JSON,” and then: key order changes, floats render differently, null/empty fields drift, pretty-print vs minified changes the bytes, …and your digests become meaningless. Minimal rules that already prevent most drift: Canonical JSON: stable key ordering, no insignificant whitespace Number contract: avoid floats where you can; if you must use them, define rounding/formatting Explicit null policy: define whether missing vs null are equivalent (usually: they are not) Stable encoding: UTF-8 everywhere If you want one practical baseline, use a strict JSON Canonicalization Scheme (JCS-like) and treat “digest mismatch” as a deterministic classification (snapshot_mismatch), not a debate. A) Same input input_digest (digest of canonicalized input) schema_version policy_version / policy_digest B) Same order seq (final order authority) HLC may be useful as provisional order, but final audit order is best pinned by seq C) Same decision evaluator_version (git sha / build id) decision_fn_digest (e.g., hash of {policy_digest + evaluator_version}) keep nondeterministic dependencies out of the evaluator (see below) { "seq": 10498231, "kind": "rollout_decision", "run_id": "exp-42:B:2026-02-18", "stage": "CANARY_25", "schema_version": "v1", "policy_version": "2026-02-18", "policy_digest": "sha256:...", "evaluator_version": "git:3f2c9a1", "decision_fn_digest": "sha256:...", "input_digest": "sha256:...", "snapshot_digest": "sha256:...", "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] } } Two pins matter most: snapshot_digest: fingerprint of the observation snapshot (metrics/log aggregates) used to decide decision_fn_digest: proves “the rulebook didn’t change” Store the snapshot itself immutably (object storage / warehouse) and fetch by digest during replay. Practical note: you don’t need full snapshots for every request. Replayability dies when snapshots disappear—or when keeping them is too expensive, so teams quietly stop. A simple retention split keeps it realistic: Hot (hours–days): full snapshots for active runs (fast iteration, quick incident response) Cold (weeks–months): compressed snapshots or rollups for audits and “why did this gate fire?” Long-term (months–years, selective): only for high-stakes domains (RML-3-ish decisions), or for representative incidents The key is to make retention a policy decision (with costs), not an accident. read events in seq order fetch observation snapshots by snapshot_digest run evaluator fixed to policy_version + evaluator_version verify verdict + reason codes match If mismatch occurs, classify deterministically: policy_mismatch evaluator_mismatch snapshot_mismatch nondeterminism_leak using now() in the decision path → use seq / watermark conditions as decision anchors; wall clock is display-only randomness (sampling) → derive seed from run_id and log it float rounding → contract rounding rules (and treat NaN as DEGRADE) calling external services inside decisions → snapshot first; evaluator reads only the snapshot via digest Replayability is not about heroics. It’s mostly log design. The next pain point is multi-source evaluation: Primary metrics (logs/warehouse) and guardrails (Prometheus) don’t align you either miss violations or DEGRADE forever Don’t chase perfect synchronization. Define “evaluation is valid up to here” as a watermark contract. { "snapshot_id": "snap-2026-02-18T10:15:00+09:00", "collected_at": "2026-02-18T10:15:12+09:00", "window": "5m", "watermark_event_time": "2026-02-18T10:09:30+09:00", "sources": [ {"name": "prometheus", "watermark_event_time": "2026-02-18T10:10:00+09:00"}, {"name": "bigquery", "watermark_event_time": "2026-02-18T10:09:30+09:00"} ], "freshness_seconds": 72, "watermark_skew_seconds": 30, "metrics": { "conversion_rate": 0.131, "error_rate_5m": 0.013, "p95_latency_ms_5m": 420 } } Example rules: freshness_seconds Tuple[int, int]: return (self.physical_ns, self.logical) def to_string(self) -> str: # Fixed-width hex to avoid lexicographic surprises. return f"{self.physical_ns:016x}-{self.logical:016x}" class HLC: """ Hybrid Logical Clock (minimal). - now(): local tick (clamp + logical counter) - observe(remote): merge remote timestamp to preserve order-ish causality """ def __init__(self) -> None: self._p = 0 self._l = 0 def now(self) -> HLCTimestamp: pt = time.time_ns() # wall clock; may jump if pt > self._p: self._p = pt self._l = 0 else: self._l += 1 return HLCTimestamp(self._p, self._l) def observe(self, remote: HLCTimestamp) -> HLCTimestamp: pt = time.time_ns() rp, rl = remote.physical_ns, remote.logical p = max(pt, self._p, rp) if p == self._p == rp: l = max(self._l, rl) + 1 elif p == self._p: l = self._l + 1 elif p == rp: l = rl + 1 else: l = 0 self._p, self._l = p, l return HLCTimestamp(self._p, self._l) this minimal implementation stores state in memory; reboot can regress → persist last HLC, or sync on startup by observing peers/aggregator a clock-broken node can inject “far future” timestamps → enforce max_clock_skew and quarantine remote timestamps beyond it (log it with reason codes like clock_anomaly:remote_future_exceeded) compare HLC as (physical_ns, logical) (tuple order) if you need stable total order, add a tie-breaker (node_id) so sort key is (p, l, node_id) use HLC as provisional order, then assign final seq at ingestion if possible Practical pattern: Local: HLC for provisional order seq for final audit/replay order You don’t need Spanner for this level of determinism. Even with seq / watermark / HLC, long-running drivers (A/B rollouts, canaries, jobs) die if they are written as a single loop. Processes restart. Deploys happen. Workers crash. So treat time-long operations as state transitions and persist state. Evaluator (pure-ish): snapshot + policy -> decision Same input → same output (replayable). Orchestrator (state machine): persists run_id state + resume conditions. This solves: re-run: state retains stage + digest pins duplicates: enforce run_id + stage uniqueness / idempotency interrupted progress: write-ahead transitions resume: DEGRADE becomes re-enterable hold (wait for snapshot/approval/time) { "run_id": "exp-42:B:2026-02-18", "stage": "CANARY_25", "policy_version": "2026-02-18", "last_snapshot_digest": "sha256:...", "last_decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"], "missing": ["p95_latency_ms_5m"] }, "resume": { "resume_token": "opaque:rsn_...", "requested_actions": [ {"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}}, {"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}} ], "next_check_at": "2026-02-18T10:45:00+09:00" } } next_check_at is operational scheduling time, not ordering authority. If your orchestrator state and your seq event log are in different stores, you get classic dual-write failure modes: state updated but event not logged → “it happened” without replay proof event logged but state not updated → duplicates and confusion Two standard fixes: A) Event sourcing: write the state-transition event into the seq log as the source of truth; derive orchestrator state as a projection. B) Outbox: write state update + “planned event” into the same DB tx; forward via outbox to the seq log. If you use outbox, receiver-side idempotency is mandatory: define an event_id (idempotency key) enforce UNIQUE(event_id) on the receiving log INSERT ... ON CONFLICT DO NOTHING And keep orchestrator transitions CAS-style: UPDATE orchestrator_runs SET stage = :to_stage, last_event_id = :event_id, updated_at = now() WHERE run_id = :run_id AND stage = :from_stage; If downstream uses the log frontier as a watermark, outbox lag can stall watermark → increase DEGRADE. So treat outbox transfer lag as a first-class SLO signal: outbox_oldest_age_seconds outbox_backlog_count outbox_to_seq_p95_seconds Reason codes: pipeline_stale:outbox_transfer_lag Policy example (conceptual): { "pipeline_policy": { "max_outbox_oldest_age_seconds": 30, "on_over_budget": "DEGRADE" } } Also consider catch-up safety: rate-limit catch-up optionally cap watermark advance rate (catch-up budget) reason-code budget exceedance for deterministic behavior wall clocks drift and lie; don’t use them as truth define truth as an ordering authority: seq (single-writer) as the baseline watermark contracts to decide “evaluation admissibility” across sources HLC for provisional ordering under partition/offline, then finalize with seq if possible make replay real with: digests (input/snapshot/policy) evaluator version pins append-only logs + correction events make long-running “time” operable with: state machines (orchestrator) idempotency + outbox/event sourcing SLOs + reason codes Time in distributed systems is not physics. agreement and contracts. TrueTime is powerful, but you don’t need it to stop your ops from turning into guesswork. You need a pinned authority for ordering — and the discipline to treat “stale/missing” as first-class (DEGRADE), measured by SLOs.