Kanishk

Technical Expertise & Systems Experience

Focused on building high-throughput, low-latency infrastructure and distributed systems. Expertise in concurrency, memory management, and scalable backend architectures.

Systems I've Built / Worked On

Architectural challenges and high-scale implementations.

High-Throughput Order Matching Engine

Simulated/Benchmarked Locally (32-core AMD)

Problem & Constraints

Real-time order matching requires deterministic execution and minimal latency. Standard concurrent queues introduced unacceptable lock contention.

Required sub-millisecond P99 latency while processing 10k transactions per second (TPS). Cannot pause for garbage collection.

Architecture Decisions

Built entirely in C++20. Chose a single-threaded execution loop for the core matching engine, offloading I/O and risk checks to a custom lock-free work-stealing thread pool.

Trade-Offs

Sacrificed horizontal scalability of the core matching engine for ultra-low vertical latency. Sharding by trading pair was required to scale beyond single-core limits.

Failure Handling & Debugging

Failure Mode: Implemented a journaling system to append-only disk before acking to the client, allowing full state reconstruction on crash without distributed consensus overhead.

Insights: Used perf and flamegraphs to identify cache-line bouncing (false sharing) between worker threads. Padded critical atomic structs to align with 64-byte cache lines.

Measured Outcome & Impact

Technical OutcomeAchieved sustained 10k TPS with P99 latency of 0.8ms in local benchmarks.

Business ImpactEnabled highly competitive, sub-millisecond market execution capable of handling extreme trading volume spikes without degradation.

Distributed Task Scheduler

Deployed to limited AWS EC2 Cluster

Problem & Constraints

Background jobs were being dropped or duplicated during worker node deployments or unpredictable traffic spikes.

Must guarantee at-least-once delivery for 500k+ jobs/day. Needed to support job retries with exponential backoff without overloading the DB.

Architecture Decisions

Re-architected the pipeline using Python and a Redis-backed queue for fast ingestion, with a PostgreSQL persistent store for job metadata and audit logs.

Trade-Offs

Chose at-least-once delivery over exactly-once, forcing all downstream job consumers to implement strict idempotency. This increased consumer complexity but vastly simplified the scheduler's scaling.

Failure Handling & Debugging

Failure Mode: Integrated a circuit breaker pattern on outgoing webhook calls. If an external API degraded, the scheduler applied backpressure and temporarily halted dispatching specific job types.

Insights: Diagnosed a recurring Redis OOM issue by identifying unbounded retry loops. Enforced a hard limit on max retries and moved dead jobs to a persistent Dead Letter Queue (DLQ).

Measured Outcome & Impact

Technical OutcomeEliminated dropped tasks and stabilized worker node CPU utilization during deployments.

Business ImpactEnsured strict SLA compliance by preventing the loss of critical background jobs during severe infrastructure and downstream API outages.

Agentic AI Orchestration Platform

Cloud Deployment (Internal Tooling)

Problem & Constraints

LLM agent workflows frequently failed mid-execution due to external API timeouts or model hallucinations, forcing users to restart complex tasks.

LLM API latency is highly unpredictable (2s to 30s). Workflows involved up to 10 sequential tool calls.

Architecture Decisions

Adopted LangGraph to model the multi-agent workflow as a persistent state machine. Used the Model Context Protocol (MCP) to sandbox tool execution.

Trade-Offs

Increased the complexity of the Python backend by introducing a graph-based state machine, sacrificing the simplicity of linear scripts for robustness.

Failure Handling & Debugging

Failure Mode: Implemented granular state checkpointing. If an LLM hallucinated a malformed JSON response, the system caught the parse error, injected a correction prompt, and retried only that specific node.

Insights: Traced workflow stalls to long-running synchronous tool calls blocking the async event loop. Refactored tool execution into separate worker threads.

Measured Outcome & Impact

Technical OutcomeEnabled resumption of failed AI tasks, reducing API token waste by 40% and drastically improving UX reliability.

Business ImpactReduced expensive third-party LLM API costs by 40% and prevented workflow abandonment by seamlessly recovering from mid-task hallucinations.

Custom STL-Compatible Vector

Core Library Component (Local)

Problem & Constraints

Needed a deeper understanding of C++ memory semantics, allocator models, and exception safety beyond just using std::vector.

Must provide zero-overhead abstractions, support custom allocators, and strictly adhere to the Rule of 5 and strong exception guarantees.

Architecture Decisions

Implemented dynamic array growth using geometric expansion. Utilized placement new and explicit destructor calls to manage object lifetimes manually, bypassing default initialization overhead.

Trade-Offs

Manual memory management increases code verbosity and risk of leaks, but is essential for bypassing standard library overhead in critical paths.

Failure Handling & Debugging

Failure Mode: Implemented strong exception safety for operations like `push_back`. If a reallocation throws during element copying/moving, the vector state is rolled back completely to prevent memory corruption.

Insights: Used Valgrind and AddressSanitizer extensively to catch memory leaks caused by incorrect move semantics during reallocation.

Measured Outcome & Impact

Technical OutcomeBuilt a fully functional, STL-compliant vector that matched std::vector performance in benchmarks, proving deep systems-level competence.

Business ImpactProvided a zero-overhead core library component that allows applications to bypass standard memory allocation bottlenecks in latency-critical paths.

Core Technologies

Curated, high-signal tools actively used in production or deep personal work.

Systems & Low-Level

Memory management, concurrent execution, and hardware-sympathetic code.

C++20

Primary systems language

"Used for writing custom allocators, lock-free data structures, and SIMD optimizations."

Concurrency

Multi-threaded execution

"Implemented work-stealing thread pools using memory barriers to avoid false sharing."

Memory Management

RAII & Smart Pointers

"Eliminated heap allocations on hot paths via pre-allocated memory pools."

SIMD

Vectorized operations

"Accelerated cosine similarity calculations for custom vector engine."

Backend & Distributed Systems

Network communication, asynchronous boundaries, and decoupled architectures.

Event-Driven Arch

Asynchronous boundaries

"Designed decoupled architectures for resilience."

Kafka

Event streaming & messaging

"Decoupled heavy background processing from critical path."

REST / gRPC

Service communication

"Designed high-throughput internal RPCs."

WebSockets

Bi-directional streaming

"Powered low-latency real-time data feeds."

RBAC

Security & authorization

"Implemented fine-grained role-based access control for internal services."

Idempotency

Idempotent API design

"Prevented duplicate processing during network retries."

AI Infrastructure

Building the infrastructure to reliably execute and orchestrate LLMs.

Python

AI Orchestration

"Built reliable multi-agent workflows using LangGraph and MCP for tool execution."

Vector Search

Semantic search pipelines

"Powered high-accuracy Retrieval-Augmented Generation (RAG)."

LangGraph

State machine orchestration

"Built reliable multi-agent workflows with retries."

Data Layer

Schema design, consistency guarantees, and access optimization.

MySQL

Primary relational datastore

"Designed normalized schemas and optimized multi-join queries via B-Tree indexing."

PostgreSQL

Secondary datastore

"Managed read-replicas for heavy aggregation reporting."

Redis

Distributed caching

"Implemented cache-aside strategies to shield the primary database during traffic spikes."

SQLite

Embedded datastore

"Provided lightweight, persistent local storage for edge-deployed agents."

Infrastructure & Observability

Deployment, telemetry, and keeping the system alive.

Docker

Containerization

"Ensured strictly reproducible builds across dev, CI, and production environments."

Prometheus

Metrics collection

"Instrumented critical paths to track P95/P99 latency and error rates."

Nginx

Reverse proxy

"Configured TLS termination, rate limiting, and L7 load balancing."

AWS

Cloud deployment

"Deployed highly-available architectures utilizing EC2 and basic VPC networking."

Key Engineering Decisions

Cross-system architectural choices and trade-offs.

Idempotency over Exactly-Once Delivery

CONTEXT

In the Distributed Task Scheduler, guaranteeing exactly-once delivery across network boundaries required complex distributed transactions (2PC).

DECISION

Mandated that all task consumers must be idempotent (e.g., using UPSERTs or tracking processed message IDs). The scheduler only guaranteed at-least-once delivery.

CONSEQUENCE

Shifted complexity to the downstream consumers, but allowed the scheduler itself to scale linearly and handle network partitions without deadlocking.

Eventual Consistency for Performance

CONTEXT

User session and caching layers required high throughput, but the primary MySQL database was becoming a bottleneck for read-heavy operations.

DECISION

Implemented a Redis cache-aside pattern. Accepted that reads might be stale by up to 5 seconds during heavy mutation loads.

CONSEQUENCE

Drastically reduced load on the primary DB, preventing connection pool exhaustion. Required careful UI design to mask eventual consistency from end-users.

Failures & Lessons

Real mistakes and the fixes that resolved them.

The Thundering Herd Cache Stampede

WHAT BROKE

When a popular, computationally expensive query result expired in Redis, hundreds of concurrent requests hit the database simultaneously to recalculate it, causing connection timeouts.

IMPACT

Cascading failure bringing down the reporting service for 15 minutes.

THE FIX

Implemented jittered TTLs (adding random +/- 10% to expiration times) and a caching mutex (only letting one request recalculate while others wait for the new cache value).

Unbounded Retries and Resource Exhaustion

WHAT BROKE

A third-party webhook endpoint went down permanently. Our scheduler kept retrying the failed jobs indefinitely with high frequency.

IMPACT

Exhausted connection pools and filled the Redis memory limit, stalling healthy jobs.

THE FIX

Enforced strict exponential backoff, a hard cap on retry attempts (max 5), and implemented a Dead Letter Queue (DLQ) for manual inspection of permanently failed jobs.

/skills
system_status:active