Fallback Routing & High Availability

An IoT platform does not usually lose data because InfluxDB is down. It loses data because the primary endpoint became slow — a rolling restart, a saturated load balancer, a warm bucket under compaction pressure — and the ingestion service kept firing synchronous writes into a socket that would eventually time out, filling its thread pool and shedding telemetry it had already accepted from the edge. Fallback routing is the discipline that turns that failure from a data-loss event into a routing decision: detect degradation deterministically, divert writes to a healthy secondary target that respects the same retention semantics, buffer anything that still cannot land, and replay the buffer in temporal order once the primary recovers. This page sits under InfluxDB Data Lifecycle & Architecture Fundamentals and covers how to build that resilient write path end to end — the circuit breaker, the tier-aware fallback target, the durable queue, and the reconciliation task that merges the two paths back together without duplicating a single point.

The failure scenario this solves

A regional gateway ingests 25,000 points per second from a fleet of industrial sensors and writes them synchronously to a single InfluxDB endpoint. At 02:14 the endpoint enters a maintenance window: writes do not fail outright, they hang. The client’s default socket timeout is 30 seconds, so every ingestion worker blocks for half a minute before raising. Within ninety seconds the worker pool is exhausted, the gateway’s inbound MQTT buffer overflows, and back-pressure propagates to the edge, where devices with no local storage simply drop readings on the floor. When the endpoint returns at 02:31, the platform has a seventeen-minute hole in its telemetry that no downstream downsampling task can reconstruct, because the raw points never existed anywhere.

Three separate mistakes compound here. The first is treating the primary endpoint as an absolute dependency rather than a probabilistic target — there is no path for a write to go anywhere else. The second is coupling write acknowledgement to synchronous success, so a slow primary directly throttles ingestion instead of being routed around. The third is the absence of any durable buffer, so a write that cannot land immediately is lost rather than deferred. The remainder of this page fixes all three: a circuit breaker that trips on latency as well as errors, a fallback endpoint that inherits the source tier’s retention contract, a crash-safe dead-letter buffer for the case where both endpoints are unreachable, and a scheduled task that backfills the buffer without violating temporal ordering.

Prerequisites

InfluxDB 2.7+ (or InfluxDB 3.x Cloud Dedicated) reachable at a primary and at least one secondary endpoint, with the native task engine enabled.
Flux 0.x query language (bundled with the versions above) for the reconciliation task.
Python 3.9+ with influxdb-client 1.36+ and tenacity 8.x for the routing layer shown below.
A fallback bucket per hot tier that mirrors the primary’s retention window and shard-group duration — provision these using bucket architecture & tiering boundaries.
Separate write tokens for the primary and secondary endpoints, each scoped to write on only the buckets it serves, following data ingestion security frameworks.
A durable local store (SQLite with WAL, Redis, or a Kafka topic) available to the ingestion host for dead-letter buffering.

Core concept: circuit breakers and tier-preserving routing

The control structure at the heart of a resilient write path is the circuit breaker, a three-state machine that sits between the ingestion worker and the network. In the CLOSED state, writes flow to the primary and the breaker samples latency and error rate over a sliding window. When failures cross a threshold it flips to OPEN: every subsequent write short-circuits straight to the fallback path without even attempting the primary, so a dead endpoint stops consuming connection-pool slots and timeout budget. After a cooldown the breaker moves to HALF-OPEN, admits a single trial write, and either closes again on success or reopens on failure.

The crucial detail that distinguishes a correct fallback from a naive one is tier preservation. Telemetry in a time-series platform lives in buckets whose retention window and shard sizing are chosen to match the data’s temperature — the boundaries defined in bucket architecture & tiering boundaries. If the primary hot bucket carries a 7-day retention and the breaker diverts its 1-second telemetry into a warm bucket with a 90-day window, the fallback has silently rewritten the data’s lifecycle: points that should expire in a week now survive for months, inflating storage and violating the expiry contract. A fallback target must therefore be a tier-equivalent bucket — same retention window, same shard-group duration — so a diverted write inherits identical expiration semantics to the write it replaced. The retention side of that contract is the subject of retention policy design; fallback routing simply must not break it.

Latency, not just errors, must trip the breaker. A common estimator for the failure signal is an exponentially weighted error rate over a window: for the (n)-th observation with failure indicator (x_n \in {0,1}),

[ e_n = \alpha, x_n + (1-\alpha), e_{n-1} ]

with a smoothing factor (\alpha) around (0.2). The breaker opens when (e_n) exceeds a configured threshold, which reacts to a burst of slow-but-not-yet-failed writes far sooner than a raw “N consecutive errors” counter would.

Step-by-step implementation

1. Provision a tier-equivalent fallback bucket

Before any code routes a write, the destination must exist with matching retention semantics. Create the fallback bucket on the secondary endpoint with the same retention window and shard-group duration as the primary hot bucket it stands in for.

bash

# Primary hot tier (for reference): 7-day retention, 1-day shards.
# Fallback MUST mirror both so diverted writes keep identical expiry.
influx bucket create \
  --name sensor_1s_hot_fallback \
  --retention 7d \
  --shard-group-duration 1d \
  --host "$INFLUX_SECONDARY_HOST" \
  --org "$INFLUX_ORG"

The load-bearing parameters are --retention and --shard-group-duration: if either differs from the primary, the fallback silently changes the data’s lifecycle. Keep the token used here scoped to write on this bucket alone.

2. Build the circuit breaker

Encapsulate the state machine so routing logic never inspects raw counters. The breaker tracks an EWMA error rate, trips on either a latency budget or the error threshold, and self-resets after a cooldown.

python

import time
import logging

class CircuitBreaker:
    def __init__(self, error_threshold=0.5, latency_budget_s=2.0,
                 cooldown_s=60, alpha=0.2):
        self.error_threshold = error_threshold   # trip when EWMA error rate exceeds this
        self.latency_budget_s = latency_budget_s  # a slow write counts as a failure
        self.cooldown_s = cooldown_s              # OPEN -> HALF-OPEN wait
        self.alpha = alpha
        self._error_rate = 0.0
        self.state = "CLOSED"
        self._opened_at = None

    def allow_primary(self) -> bool:
        """Return True if the primary should be attempted."""
        if self.state == "OPEN":
            if time.time() - self._opened_at >= self.cooldown_s:
                self.state = "HALF_OPEN"      # admit one trial write
                logging.info("circuit HALF_OPEN: trialing primary")
                return True
            return False                       # still OPEN -> straight to fallback
        return True                            # CLOSED or HALF_OPEN

    def record(self, ok: bool, latency_s: float):
        failure = 0.0 if (ok and latency_s <= self.latency_budget_s) else 1.0
        self._error_rate = self.alpha * failure + (1 - self.alpha) * self._error_rate
        if self.state == "HALF_OPEN":
            self.state = "CLOSED" if failure == 0.0 else "OPEN"
            self._opened_at = None if failure == 0.0 else time.time()
        elif self.state == "CLOSED" and self._error_rate >= self.error_threshold:
            self.state = "OPEN"
            self._opened_at = time.time()
            logging.warning("circuit OPEN: error_rate=%.2f", self._error_rate)

3. Wrap the primary write with bounded retries

Transient blips should be retried a small, bounded number of times with exponential backoff and jitter so a recovering endpoint is not hit by a thundering herd. Use tenacity and keep the attempt count low — the breaker, not the retry loop, is what handles a sustained outage.

python

from tenacity import (retry, stop_after_attempt, wait_exponential_jitter,
                      retry_if_exception_type)

class WriteTimeout(Exception):
    pass

@retry(
    stop=stop_after_attempt(2),                       # keep low; breaker owns sustained failure
    wait=wait_exponential_jitter(initial=0.5, max=4),  # backoff + jitter avoids herd on recovery
    retry=retry_if_exception_type((WriteTimeout, ConnectionError)),
    reraise=True,
)
def _write_primary(write_api, bucket, record):
    write_api.write(bucket=bucket, record=record)

4. Route the write through breaker, fallback, and buffer

The router ties the pieces together: consult the breaker, attempt the primary, divert to the tier-equivalent fallback on failure, and only if both endpoints are unreachable serialize the point to a durable dead-letter buffer for later replay. Because raw Point.tag()/field() each take a single key/value pair, build the point from the tag and field dicts.

python

import os, time, logging
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS

class TelemetryRouter:
    def __init__(self, primary, fallback, breaker, dlq):
        self.primary_api = InfluxDBClient(**primary["conn"]).write_api(write_options=SYNCHRONOUS)
        self.fallback_api = InfluxDBClient(**fallback["conn"]).write_api(write_options=SYNCHRONOUS)
        self.fallback_bucket = fallback["bucket"]   # tier-equivalent to the primary bucket
        self.breaker = breaker
        self.dlq = dlq                              # durable buffer (SQLite/Redis/Kafka)

    def _build_point(self, measurement, tags, fields, ts=None):
        p = Point(measurement)
        for k, v in tags.items():
            p.tag(k, v)
        for k, v in fields.items():
            p.field(k, v)
        if ts is not None:
            p = p.time(ts, WritePrecision.NS)
        return p

    def route(self, measurement, tags, fields, ts=None, bucket="sensor_1s_hot"):
        point = self._build_point(measurement, tags, fields, ts)

        if self.breaker.allow_primary():
            start = time.time()
            try:
                _write_primary(self.primary_api, bucket, point)
                self.breaker.record(ok=True, latency_s=time.time() - start)
                return "primary"
            except Exception as exc:
                self.breaker.record(ok=False, latency_s=time.time() - start)
                logging.warning("primary write failed: %s", exc)

        # Breaker OPEN or primary failed -> tier-equivalent fallback.
        try:
            self.fallback_api.write(bucket=self.fallback_bucket, record=point)
            return "fallback"
        except Exception as exc:
            logging.error("fallback unreachable: %s -> DLQ", exc)
            self.dlq.enqueue(point.to_line_protocol(), bucket)   # never drop the point
            return "buffered"

The dead-letter buffer must be durable across process restarts — a SQLite database in WAL mode or a Kafka topic, never an in-memory list. The full crash-safe buffering and replay implementation is developed in implementing fallback write routing during network partitions.

5. Reconcile buffered writes with a scheduled task

Once the primary recovers, deferred data in the fallback bucket (or the DLQ, after it drains back to InfluxDB) must be merged into the primary’s aggregation path without double-counting. A Flux task reads the fallback window, snaps points to fixed timestamps, and writes them into the primary bucket; because the series/timestamp key is identical, a replayed point overwrites rather than appends. Anchor the read to the last successful run using the cron & interval scheduling logic that governs every lifecycle task.

flux

import "influxdata/influxdb/tasks"

option task = {
    name: "reconcile_fallback_to_primary",
    every: 10m,
    offset: 2m,          // let in-flight fallback writes settle first
    concurrency: 1,      // never overlap -> no duplicated backfill
}

from(bucket: "sensor_1s_hot_fallback")
    |> range(start: tasks.lastSuccess(orTime: -1h))
    |> filter(fn: (r) => r._measurement == "sensor_readings")
    |> to(bucket: "sensor_1s_hot", org: "iot-platform")

The retry-safe discipline that makes this backfill idempotent — deterministic ranges, fixed boundaries, no wall-clock now() in the window — is covered in Flux scripting for task automation. Consult the official InfluxDB write-data documentation for the write and timeout semantics these tasks depend on.

Configuration reference

Setting	Accepted values	Default	Effect
`error_threshold`	float `0.0`–`1.0`	`0.5`	EWMA error rate at which the breaker trips to `OPEN`. Lower values fail over sooner but are more sensitive to noise.
`latency_budget_s`	float (seconds)	`2.0`	A write slower than this counts as a failure, so a degraded-but-alive primary trips the breaker before it hard-fails.
`cooldown_s`	integer (seconds)	`60`	Time the breaker stays `OPEN` before admitting a `HALF-OPEN` trial write. Too short re-hammers a recovering endpoint.
`alpha`	float `0.0`–`1.0`	`0.2`	EWMA smoothing factor. Higher reacts faster to recent failures; lower is steadier.
`stop_after_attempt`	integer	`2`	Bounded primary retries per write. Keep small — the breaker, not the retry loop, absorbs sustained outages.
`wait_exponential_jitter`	`initial`, `max` durations	`0.5s` / `4s`	Backoff with jitter between retries; the jitter prevents a thundering herd when an endpoint recovers.
fallback `retention`	duration literal	must equal primary	Fallback bucket retention window. Must match the primary tier or diverted writes change lifecycle.
fallback `shard-group-duration`	duration literal	must equal primary	Fallback shard sizing. Mismatches alter expiry granularity for diverted data.

Common failure modes and fixes

1. Fallback bucket in the wrong tier. Symptom: storage on the secondary grows without bound, or diverted data expires far earlier or later than the primary’s. Root cause: the fallback bucket’s retention window or shard-group duration does not match the primary hot tier. Fix: create fallback buckets with identical retention and shard sizing, and assert it in CI before trusting the path.

bash

influx bucket list --host "$INFLUX_SECONDARY_HOST" --org "$INFLUX_ORG" \
  | grep sensor_1s_hot_fallback   # confirm retention == primary

2. Breaker trips on errors but not on latency. Symptom: the primary hangs for 30 seconds per write, the worker pool saturates, yet the breaker stays CLOSED because nothing has technically errored. Root cause: the failure signal only counts exceptions, not slow successes. Fix: count any write exceeding latency_budget_s as a failure (as the record() method above does) so a degraded endpoint trips the breaker before it exhausts threads.

3. Retry storm on recovery. Symptom: the moment the primary returns, a synchronized flood of buffered writes knocks it back down. Root cause: fixed backoff with no jitter, or an unbounded retry count, causes every client to retry in lockstep. Fix: use wait_exponential_jitter, cap stop_after_attempt low, and drain the DLQ at a throttled rate rather than all at once.

4. Double-counted points after backfill. Symptom: aggregates spike after a fallback window is reconciled. Root cause: the reconciliation task appends rather than overwrites because timestamps were not snapped to a fixed boundary, or concurrency allowed overlapping runs. Fix: keep concurrency: 1, anchor reads with tasks.lastSuccess(), and ensure the series/timestamp key is identical so replayed points overwrite in place. This is the idempotency contract shared with fallback chains for missing data.

5. In-memory dead-letter buffer lost on restart. Symptom: a process crash during a dual-endpoint outage silently loses everything the DLQ was holding. Root cause: the buffer is a Python list, not durable storage. Fix: back the DLQ with SQLite in WAL mode, Redis with persistence, or a Kafka topic so buffered points survive a restart — the durable pattern detailed in implementing fallback write routing during network partitions.

Verification and testing

Confirm the resilient path works by exercising three invariants: the breaker actually opens under induced failure, diverted writes land in the tier-equivalent fallback, and a stalled reconciliation raises an alert rather than failing silently.

First, force a failure in a test harness and assert the breaker transitions. Point the primary at a dead port and verify writes route to fallback:

python

cb = CircuitBreaker(error_threshold=0.5, cooldown_s=5)
router = TelemetryRouter(primary=DEAD, fallback=LIVE, breaker=cb, dlq=dlq)
outcomes = [router.route("sensor_readings", {"dev": "t1"}, {"v": 1.0}) for _ in range(10)]
assert cb.state == "OPEN"
assert outcomes[-1] == "fallback"   # once OPEN, writes short-circuit to fallback

Second, confirm the fallback bucket is actually receiving diverted telemetry during an outage:

flux

from(bucket: "sensor_1s_hot_fallback")
    |> range(start: -15m)
    |> filter(fn: (r) => r._measurement == "sensor_readings")
    |> count()

Third, add a deadman health check so a stalled reconciliation — the primary is back but backfill has stopped — pages an operator instead of leaving a silent gap:

flux

import "influxdata/influxdb/monitor"
import "experimental"

from(bucket: "sensor_1s_hot")
    |> range(start: -20m)
    |> filter(fn: (r) => r._measurement == "sensor_readings")
    |> monitor.deadman(t: experimental.subDuration(from: now(), d: 20m))
    |> filter(fn: (r) => r.dead == true)

A lightweight Python probe can assert the same invariant in CI or a scheduled check, failing loudly if the fallback path has drained but the primary is still empty:

python

import os
from influxdb_client import InfluxDBClient

client = InfluxDBClient(url=os.environ["INFLUX_URL"], token=os.environ["INFLUX_TOKEN"], org=os.environ["INFLUX_ORG"])
q = 'from(bucket: "sensor_1s_hot") |> range(start: -20m) |> count()'
tables = client.query_api().query(q, org=os.environ["INFLUX_ORG"])
total = sum(rec.get_value() for t in tables for rec in t.records)
if total == 0:
    raise RuntimeError("primary empty after recovery: reconciliation may be stalled")
print(f"primary healthy: {total} points in last 20m")

Integration points

Fallback routing is the reliability layer that protects everything else in the lifecycle. Its fallback buckets are provisioned as part of bucket architecture & tiering boundaries, and the requirement that a fallback preserve the source tier’s expiry ties it directly to retention policy design: a diverted write must inherit the same window it would have had on the primary. The reconciliation task is scheduled with the same cadence primitives as the rest of the platform and written with the retry-safe habits of Flux scripting for task automation, while the routing client itself is an application of the broader Python client orchestration patterns used across the ingestion tier. When a reconciliation depends on other tasks having completed first — draining the DLQ before recomputing a rollup, for example — that ordering belongs to dependency mapping & DAG construction. Every fallback token is scoped with the least-privilege approach of data ingestion security frameworks, and the narrow, partition-specific mechanics live in implementing fallback write routing during network partitions.

FAQ

Should the fallback write to a different InfluxDB cluster or just a different bucket?

Both, depending on the failure domain you are protecting against. A different bucket on the same endpoint protects against per-bucket problems (a compaction stall, a full shard) but not against endpoint or host failure. A tier-equivalent bucket on a geographically separate secondary endpoint protects against the whole-node outages that cause most data-loss incidents. Production platforms usually route to a secondary endpoint and fall back to a local durable buffer only when both are unreachable.

Why trip the circuit breaker on latency and not only on errors?

Because the most damaging outages are slow, not dead. An endpoint that returns HTTP 200 after 30 seconds never raises an error, yet it exhausts the worker pool and back-pressures the edge just as thoroughly as a hard failure. Counting any write slower than a latency budget as a failure lets the breaker open before threads are consumed, which is the difference between a routing decision and a cascading collapse.

How do I stop reconciliation from double-counting replayed points?

Make the backfill idempotent. Keep the reconciliation task at concurrency: 1, anchor its read window with tasks.lastSuccess(), and ensure replayed points carry the identical measurement, tag set, and timestamp as the originals so InfluxDB overwrites the existing series/timestamp key rather than appending a new point. Snapping timestamps to a fixed boundary before writing guarantees that key equality.

What belongs in the dead-letter buffer versus the fallback bucket?

The fallback bucket holds writes that landed successfully in InfluxDB, just on the secondary path — they are real, queryable data awaiting reconciliation. The dead-letter buffer holds only the writes that could reach neither endpoint, serialized as line protocol on durable local storage. The DLQ is the last line of defence for a total outage; it should drain back into InfluxDB (throttled) as soon as either endpoint recovers.

Does fallback routing replace InfluxDB clustering or replication?

No. Native clustering and replication provide storage-layer redundancy; fallback routing provides write-path redundancy in the client and orchestration tier. They are complementary — replication protects data already committed, fallback routing protects data still in flight when an endpoint degrades. A platform with strict durability requirements uses both.

Implementing fallback write routing during network partitions — crash-safe local buffering and asynchronous replay in depth.
Bucket architecture & tiering boundaries — provision the tier-equivalent buckets a fallback writes into.
Retention policy design — keep diverted writes on the same expiry contract as the primary.
Data ingestion security frameworks — scope primary and fallback write tokens to least privilege.
Fallback chains for missing data — the aggregation-side counterpart to routing-layer resilience.

Up one level: InfluxDB Data Lifecycle & Architecture Fundamentals

# Fallback Routing & High Availability

# The failure scenario this solves

# Prerequisites

# Core concept: circuit breakers and tier-preserving routing

# Step-by-step implementation

# 1. Provision a tier-equivalent fallback bucket

# 2. Build the circuit breaker

# 3. Wrap the primary write with bounded retries

# 4. Route the write through breaker, fallback, and buffer

# 5. Reconcile buffered writes with a scheduled task

# Configuration reference

# Common failure modes and fixes

# Verification and testing

# Integration points

# FAQ

# Should the fallback write to a different InfluxDB cluster or just a different bucket?

# Why trip the circuit breaker on latency and not only on errors?

# How do I stop reconciliation from double-counting replayed points?

# What belongs in the dead-letter buffer versus the fallback bucket?

# Does fallback routing replace InfluxDB clustering or replication?

# Related

Explore this section

Related pages

Fallback Routing & High Availability

The failure scenario this solves

Prerequisites

Core concept: circuit breakers and tier-preserving routing

Step-by-step implementation

1. Provision a tier-equivalent fallback bucket

2. Build the circuit breaker

3. Wrap the primary write with bounded retries

4. Route the write through breaker, fallback, and buffer

5. Reconcile buffered writes with a scheduled task

Configuration reference

Common failure modes and fixes

Verification and testing

Integration points

FAQ

Should the fallback write to a different InfluxDB cluster or just a different bucket?

Why trip the circuit breaker on latency and not only on errors?

How do I stop reconciliation from double-counting replayed points?

What belongs in the dead-letter buffer versus the fallback bucket?

Does fallback routing replace InfluxDB clustering or replication?

Related