Slotting Architecture · 20 min read

Async Batch Processing for Velocity Recalculation

Velocity-driven slotting needs deterministic, high-throughput computation that a naive real-time stream cannot deliver without injecting WMS latency or compute contention. Async batch processing decouples I/O-bound data retrieval from the CPU-bound velocity math so a full recalculation over hundreds of thousands of SKUs completes inside a fixed maintenance window while active shifts keep sub-second pick-path responsiveness. This guide is part of the Velocity Data Ingestion & WMS Sync Pipelines system, and it owns the execution layer specifically: how you chunk the SKU set, bound concurrency so you never overwhelm a legacy ERP, absorb rate limits with backoff, and commit results idempotently so a mid-run crash resumes instead of double-counting.

What Async Batch Velocity Processing Is

Async batch velocity processing is the pattern of computing velocity coefficients for a large SKU population by dispatching many independent, non-blocking network calls concurrently under a fixed concurrency ceiling, then reducing the results into a single scored dataset that the assignment layer consumes. It is batch because it runs over a bounded snapshot on a schedule rather than reacting to every transaction, and it is async because the work is dominated by waiting on I/O — ERP reads, velocity-compute API calls, WMS writes — not by CPU.

Three properties separate a production processor from a for-loop wrapped in asyncio.gather:

Bounded concurrency, not unbounded fan-out. An unbounded asyncio.gather over 200k SKUs opens as many sockets as the event loop will allow, exhausts the connection pool, and trips the ERP’s rate limiter within seconds. An asyncio.Semaphore caps in-flight work so throughput saturates available bandwidth without a thundering herd.
Idempotent chunks with checkpoints, not one monolithic run. Each chunk is fully independent and carries an idempotency key derived from its content and the run window, so a failed chunk can be retried — or the whole job resumed from the last committed checkpoint — without producing duplicate velocity updates that would skew tier boundaries.
Backpressure, not best-effort. The processor watches event-loop lag and queue depth and self-throttles when the downstream can’t keep up, rather than letting the task queue grow unbounded during a receiving surge.

Two execution shapes you will meet in the field: a gather-and-reduce model, where all chunks are dispatched and awaited together (simplest, highest peak memory), and an as-completed streaming model, where results are consumed and freed as each chunk finishes (lower memory, better for 1M+ SKU runs). The machinery below is the gather-and-reduce form; the streaming variant swaps asyncio.gather for asyncio.as_completed and is covered in Python async batch jobs for SKU tracking.

The async batch processor — the semaphore caps in-flight work and the backpressure monitor sheds concurrency when the event loop falls behind.

Input Data Requirements

The processor consumes a flat list of SKU identifiers plus the connection parameters for the velocity-compute endpoint. The SKU set should already be normalized and validated upstream — this layer assumes the feed passed the contract enforced by Schema Validation for Inventory Feeds, so it does not re-validate field types, only handles transport-level faults. Each unit of work needs enough context to be independently retryable.

Field	Type	Precondition
`sku_id`	`str`	Non-null, unique within the snapshot; retired aliases resolved upstream
`window`	`str`	Rolling window label the compute endpoint understands (e.g. `rolling_30d`)
`run_id`	`str`	Stable per recalculation run; part of every idempotency key
`batch_size`	`int`	`500–20_000`; sized to compute-endpoint payload limits and worker memory
`max_concurrency`	`int`	`4–24`; must stay under the ERP/WMS documented TPS ceiling

from __future__ import annotations

import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any

logger = logging.getLogger("velocity.batch")


@dataclass(frozen=True)
class BatchConfig:
    """Immutable run parameters shared by every chunk in a recalculation."""
    run_id: str
    endpoint: str = "/api/v1/velocity/compute"
    window: str = "rolling_30d"
    batch_size: int = 5_000
    max_concurrency: int = 12
    max_retries: int = 3


@dataclass
class ChunkResult:
    """Outcome of one chunk — the atomic, idempotent unit of commit."""
    idempotency_key: str
    status: str                       # "ok" | "failed"
    profiles: list[dict[str, Any]] = field(default_factory=list)
    error: str | None = None
    computed_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

The quality gate that matters here is transport, not schema. A run polluted by silent 429 throttling (chunks that never complete) or by duplicate dispatch after a resume (chunks committed twice) produces confidently wrong velocity coefficients. Confirm the compute endpoint returns a stable response for a replayed idempotency key, and confirm the checkpoint store is durable, before scaling concurrency up.

Step-by-Step Implementation

The processor runs in four passes: split the SKU set into fixed-size chunks, dispatch them through a semaphore-bounded worker pool, absorb rate limits and transient faults with backoff inside each worker, then gather results and checkpoint each committed chunk by its idempotency key. Every pass is isolated so a single chunk failure never poisons the run.

1. Chunk the SKU Set into Bounded Work Units

Collapse the flat SKU list into fixed-size chunks. The chunk is the unit of retry, commit, and idempotency, so its size trades peak memory against per-request overhead: larger chunks amortize connection setup but raise the memory floor and lengthen the retry blast radius. Derive a deterministic idempotency key from the run, the window, and the chunk’s ordinal so a resumed run produces identical keys.

import hashlib


def build_chunks(sku_ids: list[str], cfg: BatchConfig) -> list[tuple[str, list[str]]]:
    """Split SKUs into fixed-size chunks, each tagged with a stable idempotency key."""
    chunks: list[tuple[str, list[str]]] = []
    for ordinal, start in enumerate(range(0, len(sku_ids), cfg.batch_size)):
        chunk = sku_ids[start:start + cfg.batch_size]
        raw = f"{cfg.run_id}:{cfg.window}:{ordinal}:{len(chunk)}"
        key = hashlib.sha256(raw.encode()).hexdigest()[:16]
        chunks.append((key, chunk))
    logger.info("run %s split into %d chunks of <=%d SKUs", cfg.run_id, len(chunks), cfg.batch_size)
    return chunks

2. Fetch and Score Chunks Under a Concurrency Ceiling

Each worker acquires a semaphore slot before touching the network, so no more than max_concurrency requests are ever in flight. The shared aiohttp.ClientSession reuses connections across chunks; its connector limits are the second guardrail behind the semaphore. Skipping a chunk whose idempotency key is already committed makes the whole pass safe to resume.

import aiohttp


async def score_chunk(
    session: aiohttp.ClientSession,
    semaphore: "asyncio.Semaphore",
    key: str,
    chunk: list[str],
    cfg: BatchConfig,
    committed: set[str],
) -> ChunkResult:
    """Score one chunk; skip if its idempotency key is already committed."""
    if key in committed:
        logger.debug("chunk %s already committed; skipping", key)
        return ChunkResult(idempotency_key=key, status="ok")
    async with semaphore:
        payload = {"skus": chunk, "window": cfg.window, "idempotency_key": key}
        return await _post_with_backoff(session, key, chunk, payload, cfg)

3. Absorb Rate Limits and Transient Faults with Backoff

Legacy ERP and WMS endpoints throttle bursty callers and drop idle sockets. The worker treats 429 as an explicit back-off signal and retries transient network faults with exponential backoff plus jitter, capping total attempts so a permanently failing chunk is routed to a dead-letter result rather than looping forever.

import asyncio
import random


async def _post_with_backoff(
    session: aiohttp.ClientSession,
    key: str,
    chunk: list[str],
    payload: dict[str, Any],
    cfg: BatchConfig,
) -> ChunkResult:
    """POST a chunk with capped exponential backoff on 429 and transient errors."""
    for attempt in range(cfg.max_retries + 1):
        try:
            async with session.post(cfg.endpoint, json=payload) as resp:
                if resp.status == 429:
                    delay = min(2 ** attempt * 1.5, 30) + random.uniform(0, 1)
                    logger.warning("chunk %s rate-limited; backing off %.1fs", key, delay)
                    await asyncio.sleep(delay)
                    continue
                resp.raise_for_status()
                body = await resp.json()
                logger.info("chunk %s scored %d SKUs", key, len(chunk))
                return ChunkResult(key, "ok", profiles=body["profiles"])
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            if attempt == cfg.max_retries:
                logger.error("chunk %s failed permanently: %s", key, exc)
                return ChunkResult(key, "failed", error=str(exc))
            delay = 2 ** attempt + random.uniform(0, 0.5)
            logger.warning("chunk %s transient error (%s); retry %d in %.1fs",
                           key, exc, attempt + 1, delay)
            await asyncio.sleep(delay)
    return ChunkResult(key, "failed", error="retries exhausted")

4. Gather Results and Checkpoint Idempotently

The orchestrator wires the passes together: build chunks, open a pooled session, dispatch every chunk through the semaphore, and gather results. Successful chunks are persisted to the checkpoint store keyed by idempotency key so a re-run skips them; failed chunks are surfaced for the dead-letter queue. The connector limits are tuned so total sockets stay under the endpoint’s ceiling even at full concurrency.

from typing import Callable, Awaitable


async def run_recalculation(
    sku_ids: list[str],
    cfg: BatchConfig,
    load_committed: Callable[[str], Awaitable[set[str]]],
    persist: Callable[[ChunkResult], Awaitable[None]],
) -> list[ChunkResult]:
    """Recalculate velocity for every SKU under bounded concurrency, checkpointing each chunk."""
    committed = await load_committed(cfg.run_id)
    chunks = build_chunks(sku_ids, cfg)
    semaphore = asyncio.Semaphore(cfg.max_concurrency)
    timeout = aiohttp.ClientTimeout(total=30, connect=10)
    connector = aiohttp.TCPConnector(limit=cfg.max_concurrency * 2,
                                     limit_per_host=cfg.max_concurrency)

    async with aiohttp.ClientSession(timeout=timeout, connector=connector) as session:
        tasks = [
            score_chunk(session, semaphore, key, chunk, cfg, committed)
            for key, chunk in chunks
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    final: list[ChunkResult] = []
    for res in results:
        if isinstance(res, Exception):
            logger.error("chunk task crashed: %s", res)
            continue
        if res.status == "ok" and res.idempotency_key not in committed:
            await persist(res)
        final.append(res)

    ok = sum(1 for r in final if r.status == "ok")
    logger.info("run %s complete: %d/%d chunks ok", cfg.run_id, ok, len(chunks))
    return final

Scored profiles leave this runner and enter the assignment layer described in Location Assignment & ABC Classification Algorithms, where each SKU is ranked against candidate locations under weight, volume, and zone constraints before any physical move is authorized.

Tuning & Calibration

Every knob above is a lever, and the two that move outcomes most are max_concurrency and batch_size. Concurrency too high trips the ERP rate limiter and inflates retry storms; too low leaves the window under-utilized and the recalculation runs long. Batch size too large spikes worker memory and lengthens the retry blast radius; too small drowns the run in per-request overhead. Set them against the compute endpoint’s measured TPS ceiling and the worker’s memory budget, then hold them stable — these are infrastructure parameters, not per-run tuning.

# async_velocity.yaml — one profile per facility / endpoint
run:
  window: rolling_30d        # demand window the compute endpoint scores over
  batch_size: 5000           # SKUs per chunk; raise for fewer, heavier requests
  max_concurrency: 12        # asyncio.Semaphore ceiling; keep under endpoint TPS
  max_retries: 3             # attempts per chunk before dead-lettering
schedule:
  full_recalc_cron: "0 2 * * *"   # nightly full run inside the maintenance window
  hot_recalc_hours: 4             # micro-batch cadence for top-velocity SKUs
  hot_sku_pct: 0.20               # fraction of catalog treated as "hot"
backpressure:
  loop_lag_ms_ceiling: 50    # shed concurrency when event-loop lag exceeds this
  max_queue_depth: 40        # pending chunks before the dispatcher pauses

# Equivalent Python config dict consumed by the orchestrator
ASYNC_VELOCITY = {
    "run": {
        "window": "rolling_30d", "batch_size": 5000,
        "max_concurrency": 12, "max_retries": 3,
    },
    "schedule": {
        "full_recalc_cron": "0 2 * * *", "hot_recalc_hours": 4, "hot_sku_pct": 0.20,
    },
    "backpressure": {"loop_lag_ms_ceiling": 50, "max_queue_depth": 40},
}

Two scheduling patterns pay off. Run a full-catalog recalculation nightly inside the maintenance window (02:00–04:00 local is the common slot, chosen so it never overlaps a picking wave or a receiving surge), and run a micro-batch every four hours over just the top 20% of SKUs by pick count. That two-tier cadence keeps high-impact tier assignments fresh without paying the full-catalog cost for the slow-mover tail. Align the cron to operational rhythm, not an arbitrary interval — recalculating tiers sub-hourly causes slotting thrash, where SKUs oscillate across boundaries and generate move tasks that cost more labor than the travel savings return.

Validation & Testing

Never let a run ship without asserting its invariants. Three properties must hold: chunking covers every SKU exactly once, a replayed idempotency key does not double-commit, and a 429 response actually triggers a retry rather than a hard failure. The following pytest checks encode all three and run in the recalculation job’s CI gate.

import asyncio

import pytest


def test_chunks_cover_every_sku_once() -> None:
    cfg = BatchConfig(run_id="r1", batch_size=100)
    skus = [f"SKU{i}" for i in range(250)]
    chunks = build_chunks(skus, cfg)
    flat = [s for _, chunk in chunks for s in chunk]
    assert flat == skus, "chunking must preserve order and coverage"
    assert len({k for k, _ in chunks}) == len(chunks), "idempotency keys must be unique"


def test_committed_chunk_is_skipped() -> None:
    cfg = BatchConfig(run_id="r1", batch_size=100)
    (key, chunk), *_ = build_chunks([f"SKU{i}" for i in range(100)], cfg)

    async def _run() -> ChunkResult:
        sem = asyncio.Semaphore(1)
        # session unused because the key is pre-committed and short-circuits.
        return await score_chunk(None, sem, key, chunk, cfg, committed={key})

    result = asyncio.run(_run())
    assert result.status == "ok" and result.profiles == []


def test_idempotency_key_is_stable_across_runs() -> None:
    cfg = BatchConfig(run_id="r1", batch_size=50)
    skus = [f"SKU{i}" for i in range(120)]
    assert [k for k, _ in build_chunks(skus, cfg)] == [k for k, _ in build_chunks(skus, cfg)]

A sample expected result for a healthy run: test_chunks_cover_every_sku_once passes with three chunks (100/100/50), test_committed_chunk_is_skipped returns an empty-profile ok result, and the orchestrator logs run r1 complete: N/N chunks ok. If coverage or key stability drifts, fix chunking before scaling concurrency — a duplicated or dropped chunk corrupts velocity silently.

Integration Points

The async processor is a throughput layer, not a data source or a decision-maker. It sits between three sibling systems, and each imposes a contract:

Upstream extraction. The SKU snapshot and its freshness come from the extractors described in WMS & ERP Polling Strategies. Trigger a recalculation on watermark advance rather than a blind cron so the batch scores current demand, not a stale window; a frozen extractor makes yesterday’s velocity masquerade as today’s.
Historical baselines. The rolling_30d window the compute endpoint scores against is only meaningful if sales history is normalized across channels, pack sizes, and promotions by Sales History Data Mapping. Unresolved unit-of-measure conversions upstream turn a single pallet SKU into a phantom hyper-mover no amount of concurrency will fix.
Contract enforcement. This layer assumes clean payloads because Schema Validation for Inventory Feeds already quarantined malformed records to a dead-letter queue at the ingestion boundary. Failed chunks here are transport failures, and they belong in a separate dead-letter path from schema failures so the two are diagnosed independently.

Downstream, the scored profiles feed the tier logic in ABC Classification Tuning for Warehouse Slotting, which converts the coefficients this processor computes into committed A/B/C classes under a hysteresis gate.

Failure Modes & Edge Cases

Unbounded fan-out exhausting the connection pool. A plain asyncio.gather over the full SKU set opens more sockets than the pool allows and trips the ERP rate limiter. Remediation: gate every worker behind the asyncio.Semaphore and cap TCPConnector(limit=...) so total sockets stay under the endpoint TPS ceiling even at peak.
Duplicate velocity updates after a resume. Re-running a partially failed job without idempotency re-commits already-processed chunks and skews tier boundaries. Remediation: derive a stable idempotency key per chunk and skip keys already in the checkpoint store, as score_chunk does.
Silent 429 throttling read as slow demand. Chunks that quietly fail to complete drop SKUs from the run, and those SKUs read as zero-velocity downstream. Remediation: treat 429 as an explicit back-off-and-retry signal, and alert when a run’s ok chunk count is below 100%.
Event-loop starvation from a blocking call. A synchronous DB driver or CPU-heavy transform inside a coroutine stalls the whole loop and inflates every chunk’s latency. Remediation: keep workers pure-async, push CPU-bound scoring to the compute endpoint or a process pool, and watch loop_lag_ms_ceiling.
Unbounded task-list memory on 1M+ SKU runs. Accumulating every task and result in memory before the reduce spikes RSS and triggers GC pauses that stall the loop. Remediation: switch to the asyncio.as_completed streaming variant that consumes and frees each result as it lands.

FAQ

How many concurrent workers should the processor run?

Start well below the compute endpoint’s documented TPS ceiling — 8 to 12 in-flight chunks is a safe default for most enterprise WMS APIs — and raise it only while watching the 429 rate and event-loop lag. The asyncio.Semaphore bound and the TCPConnector limit are two independent guardrails; keep both set so neither alone can flood the endpoint. If you see sustained rate limiting, lower max_concurrency before you touch anything else.

How do I make a recalculation safe to resume after a crash?

Chunk the run and give each chunk a stable idempotency key derived from the run id, window, and chunk ordinal. Persist each successful chunk to a durable checkpoint store keyed by that key, and load the committed set at the start of every run so score_chunk skips work that already landed. Because the key is deterministic, a resumed run produces identical keys and never double-commits velocity.

Should I use gather-and-reduce or as-completed streaming?

Use gather-and-reduce for runs that fit comfortably in memory — it is simpler and the whole result set is available at once for the downstream reduce. Switch to asyncio.as_completed streaming when the SKU population is large enough (roughly 1M+ SKUs) that holding every chunk result in memory spikes RSS and provokes GC pauses. The chunking, backoff, and idempotency logic is identical; only the reduce changes.

How often should velocity be recalculated?

Run a full-catalog recalculation nightly inside the maintenance window, and a micro-batch every four hours over the top 20% of SKUs by pick count. That keeps high-impact tiers fresh without paying full-catalog cost for slow movers. Avoid sub-hourly full recalculation entirely — it churns tier boundaries and floods material handlers with move tasks whose labor exceeds the travel savings.

Why not just run velocity computation in real time per transaction?

Because per-transaction recomputation triggers a flood of relocation tasks and continuously hammers rate-limited WMS APIs. Batch processing aligns naturally with operational rhythm: scoring runs in off-peak windows and consolidated directives ship at shift changes. Pair a nightly batch with near-real-time delta ingestion for genuine exceptions, and let the batch own the heavy taxonomy recalculation.

Python async batch jobs for SKU tracking — the as-completed streaming variant with tenacity retries and per-record schema validation.
WMS & ERP Polling Strategies — watermark extraction that produces and triggers the snapshot this processor scores.
Sales History Data Mapping — the normalized baselines the rolling window depends on.
Schema Validation for Inventory Feeds — the contract that guarantees clean payloads before this layer runs.
Velocity Data Ingestion & WMS Sync Pipelines — the parent architecture this execution layer feeds.