Kanishk Agarwal

Legacy Benchmark

7.5s

P99 Latency under Load

Polling (1 req/sec per client) generated thousands of short-lived TCP handshakes, crippling the load balancer via ephemeral port exhaustion.

Optimized Benchmark

0.8s

P99 Event Propagation

Transitioning to persistent WebSockets dropped server-side request parsing completely, trading CPU cycles for raw RAM consumption.

1. The Problem Space

In a high-throughput booking engine, clients need immediate confirmation of their order. When 10,000 Concurrent Users (CCU) wait for a background worker to finish processing, they inherently demand real-time state.

2. Why Polling Mathematically Fails

A naive implementation uses short polling (HTTP GET every 1s).

Load Calculation:
- 10,000 CCU * 1 req/sec = 10,000 RPS
- TCP Overhead: SYN, SYN-ACK, ACK on every connection lifecycle.
- Result: 99% of requests hit the database just to verify status == "PENDING". It creates a Thundering Herd that exhausts DB connection pools and OS file descriptors (TIME_WAIT state).

3. Design Decision & Alternatives

Long Polling: Cheaper on TCP, but vulnerable to connection drops and complicated timeout logic.
Server-Sent Events (SSE): Perfect for unidirectional (Server → Client), but often runs into browser limits (HTTP/1.1 allows 6 connections max per domain).
WebSockets (Chosen): Full-duplex persistent connection. Reduces HTTP header overhead from 800 bytes per request to ~2-10 bytes per frame.

4. System Architecture

Shifting from stateless REST to a Stateful architecture requires an isolation layer. You cannot directly route WebSocket traffic into your backend API without compromising deployments.

5. Core Challenges

State Consistency & The Split-Brain

If Node 1 processes a user's web socket, but the background worker commits the booking on Node 2, Node 1 doesn't know to push the update.
Solution: A Redis Pub/Sub Backplane. The worker dumps the event to Redis. All WS nodes subscribe to Redis. Node 1 sees the event, detects it holds the socket for that user, and fires the frame.

Backpressure Handling

What if a client connects from a 3G network and cannot drain the TCP buffer fast enough? The server's OS memory starts filling up buffering outbound frames.
Solution: Application-level dropping. We implement a ring-buffer per socket. If the user's outbound queue exceeds 5MB, we drop non-critical "intermediate" status updates and only send the final state, effectively discarding stale ticks.

6. Trade-offs Embraced

Engineering is about pain selection. By picking WebSockets, I chose the pain of Stateful Deployments. When we push a new Docker image, terminating a stateless Node.js REST server is easy. Terminating a WebSocket node severs 10,000 active connections simultaneously, causing a reconnection tsunami (Thundering Herd 2.0).

We traded horizontal scalability for raw latency, and mitigated the reconnection storm by implementing Jittered Backoff algorithms on the client side.

7. Future Improvements

As the system scales beyond 1M CCU, Redis Pub/Sub becomes a bottleneck (it broadcasts to all nodes, resulting in O(N) network noise). The logical next step is migrating the backplane from Redis Pub/Sub to a partitioned Apache Kafka topology, allowing Consumer Groups to filter traffic geographically before it reaches the WS nodes.

Context

System Design: Scaling State
(Polling vs WebSockets)

1. The Problem Space

2. Why Polling Mathematically Fails

3. Design Decision & Alternatives

4. System Architecture

5. Core Challenges

State Consistency & The Split-Brain

Backpressure Handling

6. Trade-offs Embraced

7. Future Improvements

Async Decoupling

System Design: Scaling State (Polling vs WebSockets)

1. The Problem Space

2. Why Polling Mathematically Fails

3. Design Decision & Alternatives

4. System Architecture

5. Core Challenges

State Consistency & The Split-Brain

Backpressure Handling

6. Trade-offs Embraced

7. Future Improvements

Async Decoupling

System Design: Scaling State
(Polling vs WebSockets)