Distributed Execution Layer:
Partitioned Log & Consumer Groups
Scaling a single-node thread pool isn't enough when you outgrow monolithic design. Moving to a distributed queue architecture (similar to Kafka/RabbitMQ) introduced a myriad of complexities regarding node coordination, message ordering, and idempotency guarantees.
By shifting from synchronous monolithic API calls to distributed partitioned queueing, the system naturally absorbs extreme traffic spikes without dropping payloads.
If a consumer worker crashes mid-process, un-ACKed messages revert to the queue for retry allocation, guaranteeing At-Least-Once delivery.
Architecture: The Partitioned Log
Critical Concurrency Concepts
1. Partitioning for Scale vs. Order
In a single FIFO queue, maintaining strict order implies single-threaded execution globally—which kills throughput. To scale, we partition the queue. By hashing a routing key (e.g., user_id), we guarantee that all events for a specific user land in the same partition. This retains strict chronological ordering per entity, while allowing N distinct partitions to be processed entirely in parallel.
2. Idempotency Guarantees
Distributed systems guarantee At-Least-Once delivery by default. Network partitions or worker OOMs guarantee that your worker will eventually execute the same payload twice. Therefore, the execution logic must be mathematically Idempotent. We implemented logical idempotency keys (tx_hash) verified against a distributed cache (e.g. Redis) before initiating any state-mutating database records.
3. offset Management & The Two-Phase Problem
If a worker writes to the database, then crashes before committing its partition offset back to the Broker, a new worker will inherit the partition and replay the transaction. We handle this via strict transactional boundaries where the offset update and the database write occur within the same transactional context where possible, or rely wholly on the Idempotency layer above.