Best practices for bucket partitioning in IoT telemetry

A fleet of 40,000 devices writing temperature, a device heartbeat, and a debug log line every second into one telemetry bucket looks harmless for a quarter and then fails three ways at once: dashboards that read the last 15 minutes slow to seconds because every query opens Time-Structured Merge (TSM) files across dozens of shard groups, the device_id tag pushes the series index past the point where it pages to disk, and low-value heartbeats inflate the storage cost of the high-value sensor stream they share a bucket with. Bucket partitioning is the deliberate act of splitting that firehose along the dimensions your queries and retention actually follow — tenant, environment, and metric class — so each query scans only relevant blocks and each data class carries its own retention and shard sizing. This page turns the tiering theory from Bucket Architecture & Tiering Boundaries into a runnable partitioning scheme, keeps the series index bounded, and routes writes into the right bucket before they ever hit the storage engine.

Prerequisites

InfluxDB 2.7+ (OSS or Cloud) with the native task engine enabled — the option task and bucket APIs below are v2-native.
Flux 0.x (bundled with the versions above) for the cardinality-bounding rollup task.
Python 3.9+ and influxdb-client 1.36+ for programmatic bucket provisioning and dynamic write routing.
An operator or all-access token scoped to create buckets and to read/write the partition buckets.
A naming convention agreed up front — this page uses <tenant>_<metric-class>_<resolution> (e.g. acme_sensor_1s, acme_heartbeat_1m).
A rough per-fleet cardinality estimate (device count × measurements × fields) so partition boundaries can be sized before ingestion, not after the index bloats.

How bucket cardinality actually scales

Before choosing partition boundaries, name the quantity that governs everything downstream. The series count in a bucket is the product of the distinct values of its tag keys, summed across measurements:

$$ C = \sum_{m}\ \prod_{k \in \text{tags}(m)} |k| $$

A single measurement carrying a 40,000-value device_id tag and a 3-value region tag is already 120,000 series — and every additional field multiplies the TSM footprint against that base. The practical operating rule is to keep each bucket below roughly 100,000 active series; past that the storage engine allocates excessive RAM for the in-memory index and write throughput degrades. Partitioning is how you keep $C$ bounded per bucket without discarding the granularity your device fleet needs, and it works along three orthogonal dimensions:

Tenant / environment isolation — one bucket per customer, facility, or deployment stage (acme_sensor_1s, staging_sensor_1s). This kills noisy-neighbour effects during a single tenant’s ingestion spike, lets you scope tokens per tenant for role-based access control, and lets each tenant carry its own retention window.
Metric-class segmentation — group by telemetry type, not by device. Heartbeats want long retention at low resolution; raw sensor readings want short retention at high resolution. Splitting them stops cheap heartbeats from inheriting the expensive sensor bucket’s storage profile.
Resolution tier — raw versus rolled-up, which is the temperature gradient covered in the tiering guide above; a partitioning scheme should name the resolution in the bucket so the downsampling aggregation pipeline has an unambiguous destination.

Solution walkthrough

Step 1 — Define partition dimensions and a naming scheme

Encode the partition dimensions directly in the bucket name so routing, retention, and dashboards can all parse a bucket’s purpose without a lookup table. Avoid folding a high-cardinality identifier such as device_id into the name (that recreates the shard-explosion problem as a bucket-explosion problem); partition on the low-cardinality dimensions — tenant, metric class, resolution — and let the tag index handle per-device separation within a bucket.

python

# Partition dimensions are low-cardinality and known at design time.
TENANTS = ["acme", "globex"]
METRIC_CLASSES = {
    "sensor":    {"resolution": "1s", "retention_days": 14},   # hot, high-value
    "heartbeat": {"resolution": "1m", "retention_days": 365},  # long-lived, low-res
    "log":       {"resolution": "raw", "retention_days": 7},   # cheap, short-lived
}

def bucket_name(tenant: str, metric_class: str) -> str:
    res = METRIC_CLASSES[metric_class]["resolution"]
    return f"{tenant}_{metric_class}_{res}"   # e.g. acme_sensor_1s

Each (tenant, metric_class) pair becomes one bucket, so the total bucket count is len(TENANTS) * len(METRIC_CLASSES) — a number you can reason about — instead of one bucket per device. The resolution string is the contract the downsampling stage reads: a task rolling acme_sensor_1s into a minute tier knows its destination is acme_sensor_1m by substitution.

Step 2 — Provision buckets with shard groups sized to retention

Create each bucket with a shard group duration sized to its retention window, not left at the default. The rule of thumb is a shard group roughly one order of magnitude smaller than the retention window: a 14-day hot bucket wants a 1-day shard, a 1-year heartbeat bucket a 7-day shard. Shards that are too short multiply file handles and compaction work; shards that are too long make retention reclaim storage in coarse blocks and force queries to scan more than a tight range() needs.

python

from datetime import timedelta
from influxdb_client import InfluxDBClient, BucketRetentionRules

client = InfluxDBClient(url="http://localhost:8086", token=TOKEN, org=ORG)
buckets_api = client.buckets_api()

SHARD_FOR_RETENTION = {  # retention_days -> shard group seconds (~1/10th of window)
    7:   int(timedelta(days=1).total_seconds()),
    14:  int(timedelta(days=1).total_seconds()),
    365: int(timedelta(days=7).total_seconds()),
}

for tenant in TENANTS:
    for mclass, cfg in METRIC_CLASSES.items():
        days = cfg["retention_days"]
        rules = BucketRetentionRules(
            type="expire",
            every_seconds=int(timedelta(days=days).total_seconds()),
            shard_group_duration_seconds=SHARD_FOR_RETENTION[days],
        )
        buckets_api.create_bucket(
            bucket_name=bucket_name(tenant, mclass),
            retention_rules=rules,
            org=ORG,
        )

every_seconds is the retention window (data older than this expires); shard_group_duration_seconds sets the reclaim granularity. Because retention drops whole shard groups, matching the shard duration to the window is what makes expiry precise instead of clumsy. Provisioning this in code — rather than clicking through the UI — means the same scheme reproduces identically across staging and production, and pairs naturally with the retention policy automation that maintains these windows over time.

Step 3 — Route writes to the right partition before they land

Enforce the partitioning at the ingestion layer so a mislabelled payload never lands in the wrong bucket. Inspect the payload metadata (tenant, metric_class), resolve the destination through the same naming function, and batch-write asynchronously. Never hard-code bucket names in business logic — a routing table keeps the partition scheme in one place.

python

from influxdb_client import Point, WritePrecision
from influxdb_client.client.write_api import ASYNCHRONOUS

write_api = client.write_api(write_options=ASYNCHRONOUS)  # non-blocking, batched

def route_and_write(payload: dict) -> None:
    tenant = payload["tenant"]
    mclass = payload["metric_class"]
    if tenant not in TENANTS or mclass not in METRIC_CLASSES:
        raise ValueError(f"unroutable payload: {tenant}/{mclass}")   # fail closed
    point = (
        Point(payload["measurement"])
        .tag("device_id", payload["device_id"])   # per-device stays a TAG, in-bucket
        .field("value", payload["value"])
        .time(payload["ts"], WritePrecision.NS)
    )
    write_api.write(bucket=bucket_name(tenant, mclass), org=ORG, record=point)

The if ... raise guard fails closed: an unknown tenant or metric class is rejected at the edge rather than silently creating a rogue bucket or polluting a valid one. For high-velocity fleets, tune the write options (batch size around 5,000, a flush interval near 1s) and add exponential backoff on HTTP 429/503; the full non-blocking pattern, including a local buffer for backpressure, is covered in Python client orchestration patterns.

Step 4 — Bound cardinality inside the hot bucket

Even a well-partitioned acme_sensor_1s bucket will drift over its series budget if raw device granularity accumulates. Route the raw high-cardinality stream through a short retention window and use a scheduled rollup to write a lower-cardinality aggregate into the minute tier, collapsing per-second detail while preserving the trend. Grouping on the same low-cardinality keys before aggregation keeps the destination bounded.

flux

option task = {name: "acme_sensor_rollup_1m", every: 1m, offset: 15s}

from(bucket: "acme_sensor_1s")
    |> range(start: -task.every, stop: now())
    |> filter(fn: (r) => r._measurement == "environment")
    // Group on bounded keys so the rollup's series count stays predictable.
    |> group(columns: ["_measurement", "region"])
    |> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
    |> to(bucket: "acme_sensor_1m")

The offset: 15s lets late-arriving edge batches close their window before the task fires; group() on region rather than device_id is the cardinality lever — the minute tier tracks per-region trend, not per-device, so its series count collapses from tens of thousands to a handful. Where a tag such as serial_number is only ever read back and never used to filter or group, promote it to a field so it bypasses the series index entirely. Robust, idempotent versions of these rollups — anchoring the window and surviving retries — are covered in writing robust Flux scripts for automated data rollups.

Gotchas and edge cases

Over-partitioning trades one bloat for another. Splitting into a bucket per device, per day, or per firmware version replaces index bloat with thousands of tiny shard groups, exhausting file handles and starving the compactor. Partition only on dimensions your queries filter on or your retention differs on; if two candidate buckets always share the same retention and are always queried together, they should be one bucket with a tag, not two buckets.

Time-windowed buckets fragment analytical queries. Explicit per-epoch buckets like telemetry_2026_q1 simplify bulk deletion but force any cross-quarter trend query to union() across buckets, and a join() across high-cardinality time-windowed buckets in production will exhaust memory. Prefer retention-driven expiry within a stable bucket over manual epoch buckets; reserve epoch partitioning for compliance data that is written once and deleted wholesale.

createEmpty: true silently multiplies rollup cardinality. For sparse sensors that go quiet, createEmpty: true in the Step 4 rollup emits a point for every group in every window even when no data arrived, inflating the minute tier with null-filled series. Keep createEmpty: false unless a downstream consumer genuinely needs a dense series — and if it does, handle the gaps explicitly rather than materializing them across the whole fleet.

Verification

Confirm each partition bucket is inside its series budget by querying the built-in cardinality function per bucket — this is the number that must stay under ~100k:

flux

import "influxdata/influxdb"

influxdb.cardinality(bucket: "acme_sensor_1s", start: -30d)
    |> yield(name: "series_count")

Cross-check the same figure from the CLI for scripted alerting, and confirm the rollup destination is materially smaller than its source — proof the partitioning and grouping actually bounded cardinality rather than just moving it:

bash

# Compare hot vs. rolled-up series counts; the 1m tier should be far smaller.
influx query 'import "influxdata/influxdb"
influxdb.cardinality(bucket: "acme_sensor_1s", start: -30d) |> yield(name: "hot")'
influx query 'import "influxdata/influxdb"
influxdb.cardinality(bucket: "acme_sensor_1m", start: -30d) |> yield(name: "warm")'

A hot count well under the budget and a warm count an order of magnitude below it confirms the scheme is holding; a hot count creeping toward 100k is the signal to split a further dimension or promote a tag to a field before write throughput degrades.

Bucket Architecture & Tiering Boundaries — the hot/warm/cold tiering model and shard-group theory this partitioning scheme implements.
How to Configure Retention Policies in InfluxDB 2.x — setting and maintaining the per-bucket expiry windows each partition depends on.
Optimizing Aggregation Precision for High-Frequency Sensor Data — preserving numeric fidelity when the Step 4 rollup collapses per-second detail.

Up: Bucket Architecture & Tiering Boundaries

# Best practices for bucket partitioning in IoT telemetry

# Prerequisites

# How bucket cardinality actually scales

# Solution walkthrough

# Step 1 — Define partition dimensions and a naming scheme

# Step 2 — Provision buckets with shard groups sized to retention

# Step 3 — Route writes to the right partition before they land

# Step 4 — Bound cardinality inside the hot bucket

# Gotchas and edge cases

# Verification

# Related

Related pages

Best practices for bucket partitioning in IoT telemetry

Prerequisites

How bucket cardinality actually scales

Solution walkthrough

Step 1 — Define partition dimensions and a naming scheme

Step 2 — Provision buckets with shard groups sized to retention

Step 3 — Route writes to the right partition before they land

Step 4 — Bound cardinality inside the hot bucket

Gotchas and edge cases

Verification

Related