Anish Ratnawat's Tech Blog

Model Context Protocol (MCP) -- Overview & Performance Benchmarks

Anish Ratnawat — Tue, 07 Apr 2026 15:23:29 GMT

What is MCP?

The Model Context Protocol (MCP) is an open standard created by Anthropic that provides a universal interface for connecting AI models to external data sources, tools, and services.

Think of it as a USB-C port for AI -- one standardized protocol instead of custom integrations for every tool.

Core Capabilities

Capability	Description
Tool Execution	Let LLMs call functions, APIs, and services in a controlled way
Resource Access	Expose files, databases, and live data to AI models
Prompt Templates	Share reusable prompt templates & workflows across clients
Sampling	Servers can request LLM completions back through the client

MCP Architecture

Host               MCP Client           MCP Server          Data Sources
(Claude Desktop,   (1:1 connection      (Exposes tools,     (APIs, DBs,
 IDE, custom app)   per server)          resources &          filesystems,
                                         prompts)            SaaS services)
      |                  |                    |                    |
      | ──── creates ──> |                    |                    |
      |                  | ── JSON-RPC 2.0 ─> |                    |
      |                  |                    | ── queries/calls ─>|
      |                  |                    | <── responses ──── |
      |                  | <── responses ──── |                    |
      | <── displays ─── |                    |                    |

Host -- The user-facing application (e.g. Claude Desktop, VS Code, a custom app). Creates and manages MCP clients.
Client -- Lives inside the host. Each client holds a stateful 1:1 session with one MCP server. Handles capability negotiation and message routing.
Server -- A lightweight process that exposes tools, resources, and prompts over the MCP protocol. Can be local or remote.

MCP Transport Modes

1. stdio (Local Only)

Communication over standard input/output streams. The host spawns the server as a child process. Simplest setup -- no networking needed.

Best for: Local tools, CLI integrations, IDE extensions, development workflows.

2. SSE -- HTTP + Server-Sent Events (Remote / Legacy)

Client sends requests via HTTP POST and receives streaming responses over an SSE channel. Works over the network.

Best for: Remote servers, web-based clients, existing HTTP infrastructure.

3. Streamable HTTP (Recommended)

The latest spec transport. Pure HTTP with optional streaming via SSE. Supports both stateful sessions and stateless request/response patterns.

Best for: Production deployments, scalable architectures, cloud-native services.

All transports use JSON-RPC 2.0 as the message format. The protocol supports three message types: requests (expect response), responses (reply to request), and notifications (fire-and-forget).

Performance Benchmarks

Test Overview

Metric	Value
Total Requests	3.9 million
Error Rate	0% (all implementations)
Languages Tested	Java, Go, Node.js, Python
Test Rounds	3 independent runs

Benchmark Tools Used

Each MCP server implemented 4 tool types covering different workload profiles:

Tool	Category	Description
`calculate_fibonacci`	CPU-Bound	Pure computation. Calculates Fibonacci numbers to stress-test raw CPU performance and function call overhead with no I/O.
`fetch_external_data`	I/O-Bound	Network I/O. Simulates fetching data from an external API to measure async I/O and network latency handling.
`process_json_data`	Data Processing	Serialization. Parses, transforms, and serializes JSON payloads to benchmark memory allocation, parsing speed, and GC pressure.
`simulate_database_query`	Latency-Sensitive	Simulated DB query with ~10 ms built-in delay. Measures overhead each runtime adds on top of a fixed-latency operation.

Latency & Throughput

Server	Avg Latency	p95 Latency	Throughput (RPS)	Total Requests	Error Rate
Java	0.835 ms	10.19 ms	1,624	1,559,520	0%
Go	0.855 ms	10.03 ms	1,624	1,558,000	0%
Node.js	10.66 ms	53.24 ms	559	534,150	0%
Python	26.45 ms	73.23 ms	292	280,605	0%

Java & Go deliver ~3x the throughput of Node.js and ~5.5x of Python
Python is ~31x slower than Go/Java; Node.js is ~12x slower

Resource Utilization

Server	Avg CPU	Avg Memory	RPS per MB Memory
Java	28.8%	226 MB	7.2
Go	31.8%	18 MB	92.6
Node.js	98.7%	110 MB	5.1
Python	93.9%	98 MB	3.1

Go uses just 18 MB of memory -- 12.5x less than Java, with identical throughput
Go delivers 12.8x more throughput per MB than Java -- crucial for container/K8s environments

Tool-Specific Latency (ms)

Tool	Java	Go	Node.js	Python
`calculate_fibonacci`	0.369	0.388	7.11	30.83
`fetch_external_data`	1.316	1.292	19.18	80.92
`process_json_data`	0.352	0.443	7.48	34.24
`simulate_database_query`	10.37	10.71	26.71	42.57

DB-bound operations narrow the gap; compute & I/O tasks show the widest spread

Key Findings

Java & Go are effectively tied on latency and throughput -- both deliver sub-millisecond averages and 1,624 RPS.
Go's memory footprint is dramatically lower at 18 MB vs Java's 226 MB -- a 12.5x advantage for containerized workloads.
Node.js & Python consume >93% CPU under load while Java and Go remain under 32%, leaving significant headroom.
Node.js is 10-12x slower due to per-request MCP server instantiation for security isolation.
All implementations achieved a 0% error rate across 3.9M requests -- stability is not the differentiator.

Production Recommendations

Go -- Cloud-Native & Cost-Optimized

Best for Kubernetes, horizontal scaling, and cloud deployments. 12.8x better memory efficiency than Java means fewer pods and lower infrastructure cost.

Java -- Lowest Latency & Mature Ecosystem

Best when absolute lowest latency matters and your team needs a rich ecosystem for complex business logic. Higher memory cost is the trade-off.

Node.js -- Moderate Traffic (<500 RPS)

Viable for teams with existing JavaScript expertise. Security-focused per-request isolation adds overhead -- acceptable at moderate scale.

Python -- Dev / Test / Low Traffic

Best suited for development, testing, prototyping, or very low-traffic scenarios (<100 RPS). Not recommended for production workloads at scale.

Conclusion

For maximum efficiency --> Go
For lowest latency + ecosystem depth --> Java
For moderate loads with JS teams --> Node.js
Keep Python for dev & prototyping

Agent Harness: The Infrastructure Layer That Makes AI Actually Work

Anish Ratnawat — Sat, 28 Mar 2026 05:00:00 GMT

The AI Reliability Problem
What Is an Agent Harness?
Harness vs. Orchestrator — What's the Difference?
The 5 Core Components of a Good Harness
Advanced Pattern: Persistent Memory
Advanced Pattern: Bug Knowledge Base
Real-World Examples from Production
Harness Engineering as a Discipline
Why the Harness Is the Moat, Not the Model
The Future: Self-Optimizing Harnesses
Conclusion

1. The AI Reliability Problem

You've seen the demos. An AI agent writes code, browses the web, makes decisions — all autonomously. It looks magical. Then you try to run it in production for a real task with 50+ steps, and it quietly goes off the rails.

This is the reliability gap. Models are getting smarter, but smart alone doesn't mean reliable. Benchmarks measure one-shot performance. Real production tasks are multi-step, long-running, and full of edge cases.

Think of it this way: A Formula 1 engine is incredible. But without a chassis, steering wheel, brakes, and tires, it doesn't go anywhere useful. The engine is the model. Everything else is the harness.

The question developers need to answer in 2026 isn't "which model is best?" It's "how do we wrap models so they work reliably?" That's what an agent harness solves.

2. What Is an Agent Harness?

An agent harness is the complete infrastructure that wraps around an AI model to manage long-running tasks. It is not the model itself. It is everything else the model needs to work reliably in the real world.

Agent = Model + Harness

The model generates responses. The harness handles everything else: memory between sessions, which tools the model can access, guardrails that prevent catastrophic failures, the feedback loops that help it self-correct, and the observability layer that lets humans monitor what's happening.

If you've used Claude Code, you've experienced a harness. What makes it powerful isn't Claude alone — it's the harness around Claude: context management, filesystem controls, tool orchestration, session persistence, and the permission model that keeps it safe.

harness/
├── context/        # memory, session state, compaction
├── tools/          # what the agent can do
├── guardrails/     # what the agent must not do
├── planner/        # how tasks are broken down
├── evaluator/      # checking output quality
└── lifecycle/      # start, handoff, end of sessions

3. Harness vs. Orchestrator — What's the Difference?

This trips up a lot of developers. The terms sound similar but they operate at different layers.

	Orchestrator	Harness
Concern	Logic and control flow	Capabilities and infrastructure
Does what	Decides what to do next	Gives the model its tools
Manages	Task sequencing, routing	Memory, context, side-effects
Enforces	Reasoning loop (ReAct, etc.)	Guardrails, permissions
Analogy	The brain of the operation	The hands and infrastructure

They work together. The orchestrator says "invoke the model with this prompt." The harness ensures when the model is invoked, it has the right tools, context, and environment. You need both. Improving either one dramatically improves real-world performance.

4. The 5 Core Components of a Good Harness

Production-grade harnesses are built around five key responsibilities. Neglect any one of them, and reliability breaks down.

Component 1: Human-in-the-loop controls

Agents must pause at high-stakes decisions. Deleting a database, charging a credit card, sending emails to customers — these need human approval. A harness defines exactly where those checkpoints are and blocks execution until a human confirms.

Component 2: Context and memory management

LLMs have no memory between sessions by default. A harness solves this with context compaction, session handoff artifacts, and dynamic retrieval (RAG). Anthropic's harness maintains a claude-progress.txt log so long tasks can resume where they left off.

Component 3: Tool call orchestration

Bad orchestration creates infinite loops and cascading failures. Good harnesses define which tools are available, when to use them, the correct order, and how to handle errors gracefully. Vercel famously removed 80% of their agent's tools and got better results — fewer choices, fewer mistakes.

Component 4: Sub-agent coordination

Complex tasks need specialized agents. One researches, another writes, a third reviews. The harness manages communication between them, merges their outputs, and resolves conflicts.

Component 5: Prompt preset management

Different tasks need different instructions. A harness stores, versions, and selects the right system prompt for each task type — rather than pasting the same monolithic prompt everywhere.

5. Advanced Pattern: Persistent Memory

By default, every time you start a new session with an LLM, it has no idea who you are, what you worked on yesterday, or what bugs you fixed last week. The model is stateless. Persistent memory is the harness layer that fixes this.

This isn't about stuffing old conversations into the context window — that's expensive and hits limits fast. It's about selectively storing, indexing, and retrieving the right memories at the right time.

The three layers of memory

L1 — In-context memory (ephemeral) What's currently in the context window. Fast, but lost when the session ends. The harness manages what lives here via compaction — summarizing older turns, dropping irrelevant tool outputs, keeping only what the model needs right now.

L2 — External memory store (session-persistent) A vector database (Pinecone, pgvector, Chroma) or key-value store that survives session boundaries. The harness writes summaries, decisions, and facts here — and retrieves them via semantic search when starting a new session.

L3 — Structured state (long-term) A progress file or structured JSON document the harness maintains across days. Anthropic's Claude Code harness uses a claude-progress.txt for exactly this — a human-readable, agent-writable log of what has been done, what is pending, and what decisions were made.

How it works end-to-end

Session start: Harness queries L2/L3 for relevant memories → injects them into the system prompt → agent starts informed, not blank.
Session end: Harness extracts key facts, decisions, and unfinished tasks → compresses them → writes to L2/L3 → next session picks up exactly here.

Tradeoffs

Advantages	Risks
Agent builds team-wide context over time	Stale memories can mislead the agent
No repeated re-explanation across sessions	Retrieval quality depends on embedding model
Faster task start — agent arrives informed	Privacy: memories may contain sensitive code
Survives model swaps — memory is external	Storage grows unboundedly without pruning
Human-readable audit trail of decisions	Debugging retrieval failures is non-trivial

Critical implementation note: Never inject all memories — inject only what's relevant to the current task. Keep retrieved memory injections under ~500 tokens. Use tags and metadata filters aggressively.

6. Advanced Pattern: Bug Knowledge Base

Every developer has lived this: you spend three hours debugging a cryptic error, find the fix, close the ticket — and six months later a colleague hits the exact same bug and spends three hours on it too. The knowledge died with the PR comment.

A bug knowledge base is a harness component that captures bug-fix pairs at the moment of resolution and makes them retrievable — by the agent, for any future developer, automatically. This turns individual debugging effort into compounding team intelligence.

The data model

BugEntry fields:
  bug_id          string    Linked GitHub issue or Jira ticket ID
  error_signature string    Canonical error message or symptom description
  root_cause      string    Why the bug occurred — human or agent-authored
  fix_summary     string    What was changed and why, in plain language
  diff            string    Actual code diff (sanitized of secrets)
  affected_files  string[]  Files involved — for scoped retrieval
  tags            string[]  e.g. ["auth", "race-condition", "postgres"]
  resolved_by     string    Developer or agent — for attribution

The four capture points

Capture 1 — Merged PR (automatic) When a PR labelled bug-fix merges into main, a GitHub Actions webhook fires. An extraction agent reads the diff and PR description, structures it into a BugEntry, and writes it to the KB. Zero manual effort after the label is applied.

Capture 2 — Agent runtime error (automatic) While an agent executes a task and hits an exception, the harness error hook intercepts it before the agent attempts a fix. It queries the KB for similar past bugs and injects the top matches into context.

Capture 3 — Manual developer submission (on-demand) For bugs fixed outside normal PRs — hotfixes, config changes, infrastructure bugs, tribal knowledge — a developer submits directly via a CLI script or internal tool.

Capture 4 — Post-fix agent write-back (automatic feedback loop) After the debugging agent resolves an issue, the harness writes the new bug-fix pair back to the KB. Every new fix the agent makes enriches the KB for the next run.

Choosing your storage backend — four tiers

The vector DB is just one option. Start at Tier 1 and migrate only when you feel the limitations. These tiers are additive — Markdown is always the source of truth.

Tier 1 — Markdown files in the repo ✅ Recommended to start

One .md file per bug inside a .bugs/ directory. Git-native, versioned, PR-reviewable, human-editable. Agent retrieves via grep.

When to use: Small teams (<10 devs), <300 bugs, or just getting started. Zero friction.



## Error
TypeError: Cannot read properties of undefined (reading 'token')
at AuthMiddleware.verify (src/auth/middleware.ts:34)

## Root cause
JWT refresh ran before user session was hydrated.
Race condition between session.init() and token.verify().

## Fix
Awaited session.init() before token.verify() in middleware.
Added guard: if (!session.ready) throw new SessionNotReady().

## Files changed
src/auth/middleware.ts, src/session/index.ts

## Tags
auth, race-condition, jwt, async

## Resolved by
@priya — 2026-03-18 — BUG-2026-042

# Harness retrieval — grep across .bugs/
import subprocess, pathlib

def retrieve_md(error: str, bugs_dir=".bugs") -> str:
    keywords = error.split()[:6]
    hits = set()
    for kw in keywords:
        out = subprocess.run(
            ["grep", "-rl", kw, bugs_dir],
            capture_output=True, text=True
        ).stdout.strip()
        hits.update(out.splitlines())
    docs = [pathlib.Path(p).read_text() for p in list(hits)[:3]]
    return "\n\n---\n\n".join(docs)

Advantages	Limitations
Lives in repo — versioned in Git	Keyword retrieval only — no semantic
PRs review the KB alongside code	Slow at scale (500+ files)
Devs read and edit directly	Misses paraphrase matches
Zero infra — works offline	Duplicate detection is manual

Tier 2 — Markdown source + SQLite FTS5 index ✅ Recommended at scale

Markdown stays the human-readable source of truth. A SQLite FTS5 database provides fast full-text search. Index is rebuilt by CI when .bugs/*.md files change. No external services.

When to use: 300–2000 bugs, or when grep is getting slow.

# index_bugs.py — run in CI on .bugs/ changes
import sqlite3, glob, pathlib

conn = sqlite3.connect("bugs.db")
conn.execute("""
  CREATE VIRTUAL TABLE IF NOT EXISTS bugs USING fts5(
    bug_id, error_signature, root_cause, fix_summary, tags
  )""")

for path in glob.glob(".bugs/*.md"):
    text = pathlib.Path(path).read_text()
    sections = parse_md_sections(text)
    conn.execute(
        "INSERT OR REPLACE INTO bugs VALUES (?,?,?,?,?)",
        (pathlib.Path(path).stem,
         sections["Error"], sections["Root cause"],
         sections["Fix"],   sections["Tags"])
    )
conn.commit()

# Retrieval — FTS5 ranked full-text search
def retrieve_fts(query: str):
    rows = conn.execute(
        "SELECT * FROM bugs WHERE bugs MATCH ? "
        "ORDER BY rank LIMIT 3", (query,)
    ).fetchall()
    return rows

Advantages	Limitations
Markdown stays editable and readable	Index must rebuild when files change
SQLite is a single local file — no server	Still keyword-based — not semantic
FTS5 is very fast at 10,000+ entries	Two things to keep in sync
No embedding model or API needed	Misses paraphrase matches

Tier 3 — JSONL flat file (Agent-heavy teams)

One JSON object per line, append-only. Best as the agent write target, with markdown as the human read layer.

When to use: Agents are writing frequently, or you want zero-overhead append writes.

# Retrieval by tag filter
def retrieve_jsonl(query_tags: list, path="bugs.jsonl"):
    results = []
    with open(path) as f:
        for line in f:
            bug = json.loads(line)
            if any(t in bug["tags"] for t in query_tags):
                results.append(bug)
    return results[:3]

# Write — agent appends after fixing a bug
def store_jsonl(entry: dict, path="bugs.jsonl"):
    with open(path, "a") as f:
        f.write(json.dumps(entry) + "\n")

Tier 4 — Vector database (1000+ bugs, semantic search)

Embeddings-based similarity search. Finds bugs even when wording differs. Add this on top of markdown, never instead of it.

When to use: 1000+ bugs, or when keyword search misses too many relevant matches.

import chromadb
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.PersistentClient(path="./bug_kb")
bugs = db.get_or_create_collection("bug_entries")

def store_vector(entry: dict):
    text = entry["error_signature"] + " " + entry["root_cause"]
    embedding = encoder.encode(text).tolist()
    bugs.add(
        ids=[entry["bug_id"]],
        embeddings=[embedding],
        documents=[json.dumps(entry)],
        metadatas=[{"tags": json.dumps(entry["tags"])}]
    )

def retrieve_vector(error: str, k=3) -> list:
    embedding = encoder.encode(error).tolist()
    results = bugs.query(query_embeddings=[embedding], n_results=k)
    return [json.loads(d) for d in results["documents"][0]]

Decision guide

Situation	Recommended approach
Greenfield, any size team	Start with Markdown files
Grep getting slow (>300 bugs)	Add SQLite FTS5 index
Agent writing frequently	JSONL as write target + Markdown for humans
Semantic misses (>1000 bugs)	Add Vector DB on top of Markdown
Migrating tiers later	Markdown is always the source of truth

Don't start with a vector DB. The bottleneck on a young codebase is knowledge capture, not retrieval speed. A .bugs/ folder with 50 markdown files and a grep retriever will outperform an over-engineered vector store day one.

7. Real-World Examples from Production

Anthropic — Claude Code: the three-agent harness

Anthropic uses a multi-agent harness for long-running coding tasks. One agent plans, one generates code, and one evaluates quality. Context resets are paired with structured handoff artifacts — so the next agent starts from a known, clean state. This solved the classic problem of context drift over multi-hour sessions.

Manus: 5 harness rewrites, same model, 5× better reliability

Manus rewrote their harness architecture five times in six months. The underlying model didn't change. Each rewrite improved task completion rates purely through better structure: smarter context handling, tighter tool definitions, and cleaner sub-agent coordination. The model was never the bottleneck.

Microsoft — Azure SRE Agent: 40.5 hours to 3 minutes

Microsoft's SRE agent harness wires MCP tools, telemetry, code repos, and incident management into a single pipeline. "Intent Met" score rose from 45% to 75% on novel incidents after shifting from bespoke tooling to a file-based context system. The system has handled 35,000+ production incidents autonomously.

Vercel: subtraction as harness improvement

Vercel's team removed 80% of their agent's available tools. The result: fewer steps, fewer tokens, faster responses, and higher task success rate. Right-sizing the toolset is a harness decision, not a model decision.

8. Harness Engineering as a Discipline

Harness engineering is now a standalone discipline — distinct from MLOps and DevOps, though it borrows from both.

MLOps — model performance over time (training, deployment, retraining)
DevOps — software delivery pipelines (CI/CD, infrastructure)
Harness engineering — agent behavior in real-time execution, right now, on this task

Key tools as of early 2026:

Claude Agent SDK   → general-purpose harness, built-in context mgmt
CrewAI Flows       → event-driven multi-agent orchestration
LangChain          → composable harness primitives
AutoHarness        → automated harness engineering (6-step governance)
AutoAgent          → meta-agent that writes and optimizes its own harness

Emerging role: "Harness engineer" is entering job descriptions at companies building agent-powered products. The skillset combines software engineering with AI-specific knowledge of context management, prompt design, and agent evaluation.

9. Why the Harness Is the Moat, Not the Model

The model is becoming a commodity. Claude, GPT, Gemini — on static benchmarks, the gap is shrinking fast. The real differentiation is now infrastructure.

Metric	Result
Manus harness rewrites, same model	5× reliability improvement
Tools Vercel removed to improve reliability	80%
Microsoft SRE "Intent Met" score improvement	+30%
Benchmark swing from harness setup alone	5+ points

All of these came from changing the harness — not the model. You can fine-tune a competitive model in weeks. Building production-ready harnesses takes months or years. That's the moat.

10. The Future: Self-Optimizing Harnesses

AutoAgent (April 2026) lets a meta-agent build and iterate on a harness autonomously overnight — modifying the system prompt, tools, and orchestration, running benchmarks, and keeping only changes that improve scores. In a 24-hour run, it hit #1 on SpreadsheetBench (96.5%) and top score on TerminalBench (55.1%) — beating every hand-engineered entry.

The human's job shifted from "engineer who edits agent.py" to "director who writes program.md."

Looking ahead: harnesses will become the primary tool for solving model drift — detecting exactly when a model stops reasoning correctly after its 100th step, feeding that data back into training. We're heading toward a convergence of training and inference environments, and the harness is at the center of that shift.

11. Conclusion

2025 proved agents could work. 2026 is about making them work reliably at scale. The model is a component. The harness is the system.

Constrain what agents can do. Inform them about what they should do. Verify their work. Correct their mistakes. Keep humans in the loop at high-stakes decisions. This is harness engineering — and it's the most important infrastructure skill for developers building AI products right now.

The engine matters. But the car is what wins races.

[Thoughts by Anish, rephrased by Claude]

Tags: agent-harness · AI infrastructure · LLM · harness-engineering · claude-code · multi-agent · developer

Consistent Hashing: Explained with Implementation Steps

Anish Ratnawat — Fri, 07 Mar 2025 18:30:00 GMT

In distributed systems, managing data placement and load balancing efficiently is crucial. One powerful tool for addressing these challenges is consistent hashing. This blog will explore consistent hashing, provide an example, and discuss why it is superior to other approaches in certain scenarios.

What is Consistent Hashing?

To understand consistent hashing, it is helpful to first examine traditional hashing and its limitations.

Traditional Hashing:

In traditional hashing, a hash function maps keys directly to buckets (nodes).

Hash Functions are any functions that map value from an arbitrarily sized domain to another fixed-sized domain, usually called the Hash Space. The values generated as an output of these hash functions are typically used as keys to enable efficient lookups of the original entity.

For example:

- Suppose you have a Distributed File Store system where users can upload and read files. The files are stored across 4 servers, and the hash function assigns files to servers using the formula hash(fileName) % number_of_servers.
  - If the number of servers is 4, and hash(fileName) returns 9 for a specific file (e.g., "image.png"), it will be stored in server 9 % 4 = 1. Similarly, a request to read "image.png" will also be routed to server 1 using the same logic.

Limitations of Traditional Hashing:

High Disruption with Node Changes:
- If a new server is added or an existing server is removed, almost all keys need to be rehashed and redistributed.
  - For example, consider a Distributed File Store system where users upload and read files, and files are distributed across 4 servers using hash(fileName) % 4. If a new server is added (making it 5 servers), the formula changes to hash(fileName) % 5. As a result, files previously mapped to a specific server will now likely be assigned to different servers. For instance, a file that was on Server 3 with hash(fileName) % 4 = 3 might now be moved to Server 4 with hash(fileName) % 5 = 4.
- This results in significant overhead and potential performance degradation.
Load Imbalance:
- If the hash function does not distribute keys evenly, some servers may become overloaded while others remain underutilized.
Scalability Issues:
- Scaling up or down in response to load is not seamless due to the need for global rehashing.

How Consistent Hashing is Better:

Consistent hashing addresses these issues by using a different approach. Instead of directly mapping keys to nodes, both keys and nodes are placed on a virtual ring. Keys are assigned to the nearest node in the clockwise direction.

When the nodes are added to the virtual ring, only the keys mapped to the adjacent nodes will be remapped. Similarly, when nodes are removed from the virtual ring, only the keys of the removed node will be remapped. This approach ensures that when a node is added or removed, only a subset of keys needs to be remapped, making the system more resilient to changes.

The Key Idea:

The hash space is visualised as a circle (0 to 2^m - 1, where m is the number of bits in the hash).
Each node (e.g., server) is assigned a position on the circle using a hash function.
Each key is also assigned a position on the circle.
A key is assigned to the first node clockwise from its position.

Example

Suppose we have three servers (A, B, and C) in a distributed file storage system, where users upload and read files, and we use consistent hashing to distribute files.
We created 2 virtual nodes of each server so that load will distribute among them evenly and reducing the possibility of cascading failure.

Step 1: Assign Nodes to the Ring
- Server N1a,N2b,N3c is hashed to position 10, 60 and 80.
- Server N2a, N2b, N3c is hashed to position 30,90 and 110.
- Server N3a,N3b,N3c is hashed to position 120, 20 and 50.
- Server N4a,N4b,N4c is hashed to position 40, 70 and 100.
Step 2: Map Files to the Ring
- File F1 ("image1.png") is hashed to position 5.
- File F2 ("doc1.pdf") is hashed to position 25.
- File F3 ("video1.mp4") is hashed to position 70.
Step 3: Place Files
- F1 (position 5) is assigned to Server N1a (first node clockwise).
- F2 (position 25) is assigned to Server N2a.
- F3 (position 70) is assigned to Server N4b.

Adding a new Node

Suppose a new server, N5, is added and hashed to position 27.

Only one file (F2 position 25) is reassigned to N5, illustrating minimal disruption.

Removing a Node

Suppose a Server N4 is down and removed. Keys mapped to Server N4 will be remapped.
Only File F3(position 70) will be reassigned to Server N1c.

Implementation Details

Here is a basic Java implementation of consistent hashing:

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class ConsistentHashing {
    private final int numReplicas;
    private final SortedMap ring;

    public ConsistentHashing(int numReplicas) {
        this.numReplicas = numReplicas;
        this.ring = new TreeMap<>();
    }

    // Hash function - MD5
    private int hash(String key) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] digest = md.digest(key.getBytes());
            return ((digest[0] & 0xFF) << 24) | ((digest[1] & 0xFF) << 16) | ((digest[2] & 0xFF) << 8) | (digest[3] & 0xFF);
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException(e);
        }
    }

    // Adding new node in the ring
    public void addNode(String node) {
        for (int i = 0; i < numReplicas; i++) {
            String replicaKey = node + ":" + i;
            ring.put(hash(replicaKey), node);
        }
    }

    // Removing node from the ring
    public void removeNode(String node) {
        for (int i = 0; i < numReplicas; i++) {
            String replicaKey = node + ":" + i;
            ring.remove(hash(replicaKey));
        }
    }

    // Get the node/server to map the given key/value
    public String getNode(String key) {
        if (ring.isEmpty()) {
            return null;
        }
        int hashKey = hash(key);
        if (!ring.containsKey(hashKey)) {
            SortedMap tailMap = ring.tailMap(hashKey);
            hashKey = tailMap.isEmpty() ? ring.firstKey() : tailMap.firstKey();
        }
        return ring.get(hashKey);
    }

    public static void main(String[] args) {
        ConsistentHashing ch = new ConsistentHashing(3);
        ch.addNode("A");
        ch.addNode("B");
        ch.addNode("C");

        System.out.println(ch.getNode("K1")); // Node responsible for K1
        System.out.println(ch.getNode("K2")); // Node responsible for K2

        ch.addNode("D"); // Add a new node
        System.out.println(ch.getNode("K2")); // Node responsible for K2 after adding D
    }
}

Benefits of Consistent Hashing

Minimal Key Movement:
- When a node joins or leaves, only a small portion of keys are remapped. This is in contrast to traditional hashing, where all keys might need to be redistributed.
Load Balancing:
- Keys are distributed across nodes more evenly, especially when using techniques like virtual nodes (assigning multiple positions for each physical node on the ring).
Scalability:
- Adding or removing nodes is seamless, making consistent hashing ideal for systems with dynamic scaling requirements, such as cloud-based applications.
Fault Tolerance:
- When a node fails, its keys are redistributed to adjacent nodes on the ring, ensuring system continuity.

Comparison with Other Hashing Techniques

Feature	Traditional Hashing	Consistent Hashing
Key Movement on Changes	High (many keys remapped)	Low (few keys remapped)
Scalability	Poor (requires full rehash)	Excellent
Load Balancing	Depends on hash function	Enhanced with virtual nodes
Resilience to Failures	Limited	High

Applications of Consistent Hashing

Distributed Caching:
- Systems like Memcached and Redis use consistent hashing to distribute keys across nodes.
Load Balancers:
- Consistent hashing helps in assigning incoming requests to servers in web applications.
Distributed Databases:
- Databases like Cassandra and DynamoDB leverage consistent hashing for data partitioning and replication.

Conclusion

Consistent hashing is a cornerstone of modern distributed systems, enabling efficient and resilient data placement. Its ability to handle dynamic changes with minimal disruption makes it a go-to strategy for scalable and fault-tolerant applications.

Whether you’re building a distributed cache, a load balancer, or a database, understanding and implementing consistent hashing can significantly enhance your system's performance and reliability.

Exploring Retrieval Augmented Generation (RAG) with Vector Databases and AI Agents

Anish Ratnawat — Mon, 03 Feb 2025 17:40:42 GMT

One of the recent breakthroughs is Retrieval Augmented Generation (RAG). This concept blends the power of generative models with external retrieval systems to enhance the quality and accuracy of responses. When coupled with vector databases and AI agents, RAG creates a highly dynamic and intelligent system capable of delivering more contextually relevant and fact-based outputs. In this blog, we will dive into how RAG works, the role of vector databases, and how AI agents enhance this process.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is an approach where a generative model doesn't rely solely on its learned parameters but instead enhances its output by retrieving information from a large corpus of data. The retrieval process ensures that the AI system has access to the most up-to-date and accurate information during its response generation.

Traditional models often struggle with answering questions that require factual knowledge not seen during training, leading to hallucinations or incorrect answers. RAG improves upon this by using an external database to retrieve relevant information and passing that information to the generative model for more context-aware and grounded responses.

How Does RAG Work?

Query Input: A user inputs a query, much like any other question or request posed to a system.
Retrieval: The system first searches an external source of knowledge (like a vector database) for documents, texts, or passages related to the query.
Document Ranking: Using techniques like semantic search or nearest neighbor search, relevant documents are ranked and selected based on how similar they are to the query.
Generation: The retrieved documents are then passed as context to a generative model, like GPT-3 or GPT-4. This model uses the retrieved information along with its internal knowledge to generate a well-informed, accurate response.
Response Output: The generative model creates a response that incorporates the retrieved information, ensuring it is grounded in facts and highly relevant to the user's query.

The Role of Vector Databases in RAG

Vector databases play a critical role in the retrieval process of RAG. These databases store embeddings (dense vector representations) of large datasets, which makes it easy to perform efficient similarity searches.

When a query is inputted, it is transformed into a vector through a process known as embedding. This vector is then compared against the vectors stored in the database to find the most relevant documents. Vector databases are optimized for this task, offering high-performance similarity search capabilities. Some popular vector databases include:

FAISS (Facebook AI Similarity Search): An open-source library that allows fast similarity search in high-dimensional spaces.
Pinecone: A managed vector database service that offers scalable similarity search.
Weaviate: An open-source vector search engine that can integrate with various machine learning models.

These vector databases help ensure that RAG can retrieve the most relevant documents in real-time, even from massive data corpora.

Vector Representation of Texts

To efficiently perform search, text data (such as documents, articles, or websites) must be converted into vectors. This is done using embedding models like Sentence-BERT or OpenAI’s embedding models. These models convert each piece of text into a vector of fixed dimensionality, which can then be indexed by the vector database. Similarity measures such as cosine similarity or Euclidean distance are used to rank the retrieved documents.

A Practical Example of RAG with Vector Databases and AI Agents

Let’s consider an example of a chatbot built using RAG with a vector database and AI agent:

Scenario: An AI-powered Virtual Assistant for Technical Support

Imagine you are building a virtual assistant for technical support in a software company. Users will ask questions about the software’s features, installation guides, troubleshooting steps, etc.

Here’s how RAG can be used:

User Query: "How do I install the software on Linux?"
AI Agent: The AI agent processes the query, recognizes that the user is asking about software installation on a Linux system, and formulates a precise query for the vector database: "Linux installation guide for software."
Retrieval: The vector database retrieves relevant documents, such as installation guides, forums, or knowledge base articles related to Linux installations.
Generation: The generative model takes these documents and crafts a coherent, step-by-step installation guide tailored to the user’s query.
Response: The AI agent outputs: "To install the software on Linux, follow these steps... [steps from the retrieved guide]"

The AI agent ensures the process is seamless and context-aware, making it easy for the user to get accurate and relevant answers without needing to sift through long documentation.

Implementation

To implement the example you described using LangChain and a vector database, you can follow the steps outlined below. We will break down the process into key components:

Setting up the vector database to store and retrieve relevant documents.
Integrating LangChain to connect the vector database with a language model for retrieval-augmented generation (RAG).
Building the AI Agent that processes the user query, retrieves relevant data, and generates a response.

Step 1: Setup Vector Database

We will use FAISS (Facebook AI Similarity Search) or Pinecone as the vector database to store document embeddings. First, we need to create embeddings for your documents and store them in the database.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import faiss
import os

# Initialize OpenAI embeddings model (or any other model)
embeddings = OpenAIEmbeddings()

# Initialize the vector store (FAISS in this case)
faiss_index = FAISS.load_local("path/to/faiss_index")  # Load or create your FAISS index

# Assuming documents are available as a list of strings
documents = [
    "Linux installation guide for software...",
    "How to troubleshoot software on Linux...",
    "Windows installation steps for software...",
    # more documents here
]

# Create embeddings for documents
doc_embeddings = embeddings.embed_documents(documents)

# Store documents in the FAISS index
faiss_index.add(np.array(doc_embeddings).astype(np.float32))  # FAISS index requires float32 embeddings

Step 2: Retrieval-augmented Generation (RAG) Setup

Now we integrate LangChain to allow the agent to retrieve relevant documents from the vector database and generate responses based on that.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.agents import initialize_agent
from langchain.agents import Tool

# Create a retrieval chain
retriever = faiss_index.as_retriever(search_kwargs={"k": 3})  # Retrieve top 3 results

# Use OpenAI or any other LLM for generation
llm = OpenAI(temperature=0.7)

# Create a RetrievalQA chain (combines retrieval and generation)
qa_chain = RetrievalQA(combine_docs_chain=llm, retriever=retriever)

# Define the tools for the agent (including QA system)
tools = [
    Tool(
        name="Technical Support Assistant",
        func=qa_chain.run,
        description="Retrieve technical support documents from the knowledge base."
    )
]

# Initialize the agent with the tools and an LLM
agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description", verbose=True)

Step 3: Implement the AI Agent for User Queries

The agent will now be capable of handling user queries related to technical support. When a user asks, for example, "How do I install the software on Linux?", the agent will retrieve relevant documents and generate a response.

# Simulating user query
user_query = "How do I install the software on Linux?"

# Pass the query to the agent for processing
response = agent.run(user_query)

# Display the AI's response
print(response)

Final Workflow

User submits a query: "How do I install the software on Linux?"
Vector database retrieves relevant documents based on query embeddings.
LangChain agent processes the retrieved documents and passes them to the generative model (OpenAI, or any other model you're using).
AI agent generates a response combining retrieved documents in a coherent way, such as a step-by-step guide on how to install the software on Linux.

Additional Notes:

You can store documents as embeddings in your vector database using various models, including OpenAI, SentenceTransformers, or other pre-trained models.
Ensure the knowledge base is regularly updated to maintain accuracy and relevance.
The agent can be enhanced with more advanced features like error handling, multi-step reasoning, or including additional tools for different types of queries.

Exploring AI Agents: Step-by-Step Implementation Insights

Anish Ratnawat — Fri, 24 Jan 2025 18:30:00 GMT

Artificial Intelligence (AI) has evolved significantly over the years, with Large Language Models (LLMs) leading the way in natural language understanding and generation. However, a new paradigm is emerging—AI Agents. Unlike traditional LLMs, AI Agents possess autonomy, memory, and the ability to perform goal-oriented tasks, making them more efficient in real-world applications. In this blog, we will explore what AI Agents are, how they differ from LLMs, how to develop custom AI Agents, and their real-world use cases. Finally, we will walk through an example of an AI Agent designed for customer support in an online ticket booking system.

How AI Agents Differ from Traditional LLMs

While both AI Agents and LLMs leverage natural language processing, they differ in key aspects:

Feature	Traditional LLMs	AI Agents
Autonomy	Passive, responds to prompts	Active, initiates tasks based on goals
Memory	Stateless, no memory retention	Stateful, can store and retrieve information
Task Execution	Provides responses without action	Can execute tasks and interact with external systems
Multi-Step Reasoning	Processes a single query at a time	Can break complex problems into sub-tasks and complete them

Traditional LLMs require human intervention to drive conversations, whereas AI Agents can operate independently, making decisions and performing tasks dynamically.

How to Develop Custom AI Agents

Developing a custom AI Agent involves several key steps:

Define the Objective: Identify the purpose of the AI Agent. For example, automating customer service interactions.
Choose a Framework: Libraries such as LangChain, AutoGen, and OpenAI's Function Calling API can help build AI Agents.
Implement Memory: Utilize vector databases like Pinecone or Redis to provide persistent memory.
Incorporate Tools & APIs: Equip the agent with access to databases, APIs, and external tools to complete tasks.
Implement a Decision-Making Process: Use reinforcement learning or rule-based logic for better decision-making.
Deploy and Monitor: Deploy the agent to production and continuously optimize its performance.

Coding Example Using LangChain

Below is a simple example of building a custom AI Agent using LangChain:

from langchain.llms import OpenAI
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool

# Define an LLM instance
llm = OpenAI(model_name="gpt-4")

# Define a tool for the agent to use
def fetch_ticket_availability(query):
    return "Available tickets for your destination: Flight A, Flight B, Flight C"

tool = Tool(
    name="TicketAvailability",
    func=fetch_ticket_availability,
    description="Fetch available tickets based on user query"
)

# Initialize the agent
agent = initialize_agent(
    tools=[tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Test the agent
response = agent.run("Find me flights from New Delhi to New York for next Monday")
print(response)

This example demonstrates how to integrate a LangChain-powered AI Agent with an external tool to fetch flight availability based on user input.

The agent determines whether to call a tool based on the input query and its internal reasoning process. In LangChain, this is achieved through a combination of:

1. Tool Descriptions: Each tool (like `TicketAvailability` in this case) has a description that helps the agent understand when to use it.

2. LLM Decision-Making: The agent uses an LLM to analyze the input query and decide if any tool needs to be invoked.

3. REACT Framework: LangChain's `ZERO_SHOT_REACT_DESCRIPTION` agent type follows the ReAct (Reasoning + Acting) paradigm, meaning it first reasons about the input, decides on an action, and then executes the appropriate tool.

4. Execution Flow:

- The agent receives a user query.

- It parses the intent (e.g., finding flights).

- If the query matches the function of a registered tool (e.g., fetching ticket availability), the agent calls that tool.

- The tool executes its function and returns a response.

- The agent processes the response and provides a final answer to the user.

Thus, when the user asks, *"Find me flights from New Delhi to New York for next Monday"*, the agent:

- Recognizes that the query is related to flight availability.

- Identifies `TicketAvailability` as a relevant tool.

- Calls the function `fetch_ticket_availability()`, retrieves the results, and returns them to the user.

Use Cases of AI Agents

AI Agents can be applied in multiple domains, including:

Customer Support: Handling queries, resolving complaints, and managing bookings.
Healthcare: Assisting with medical diagnoses and patient follow-ups.
Finance: Providing investment recommendations and fraud detection.
E-commerce: Offering personalized shopping assistance and order tracking.
Software Development: Automating bug detection and generating code snippets.

Example: AI Agent for Customer Support in Online Ticket Booking

Let’s walk through an example of an AI Agent designed for customer support in an online ticket booking system.

Objective

To automate customer queries, assist with ticket bookings, cancellations, and modifications.

Architecture

LLM for Natural Language Understanding - GPT-based model for conversational interface.
Memory Store - A Redis-based database for storing user history.
APIs for Integration - Connecting with ticket booking systems (e.g., airline, train, event platforms).
Decision Engine - Rule-based or reinforcement learning model for handling customer queries.

Workflow

User Query: "I want to book a flight from New Delhi to New York for next Monday."
Intent Recognition: The agent extracts key details: origin (New Delhi), destination (New York), date (next Monday).
API Call: The agent fetches available flights and presents options.
User Confirmation: The user selects a preferred flight.
Booking Completion: The agent books the flight and provides a confirmation.
Follow-up: If needed, the agent can assist with cancellations, seat selection, or meal preferences.

Sample Implementation of AI Agent for Customer Support based on above

from langchain.chat_models import ChatOpenAI
from langchain.memory import RedisChatMessageHistory
from langchain.agents import initialize_agent, Tool
from langchain.tools import tool
import requests
import datetime

# Define a function to fetch available flights
def fetch_flights(origin, destination, date):
    # Placeholder function: Replace with actual API calls to airline or travel service
    return [
        {"flight": "AI 101", "departure": "10:00 AM", "arrival": "2:00 PM", "price": "$500"},
        {"flight": "UA 202", "departure": "1:00 PM", "arrival": "5:00 PM", "price": "$550"},
    ]

# Define a function to book flights
def book_flight(flight_id, user_details):
    # Placeholder function: Replace with actual API call to book the flight
    return {"status": "confirmed", "flight_id": flight_id, "user": user_details}

# LangChain Tool for fetching flights
@tool
def get_flights(origin: str, destination: str, date: str):
    """Fetches available flights given origin, destination, and date."""
    flights = fetch_flights(origin, destination, date)
    return flights

# LangChain Tool for booking flights
@tool
def book_flight_tool(flight_id: str = None, user_details: dict = None):
    """Books a flight given flight ID and user details. If missing, prompts user for input."""

    if not flight_id:
        return "Please provide a flight ID from the available options."

    if not user_details or "name" not in user_details or "email" not in user_details:
        return "Please provide user details including name and email."

    booking = book_flight(flight_id, user_details)
    return booking


# Memory store (Redis)
memory = RedisChatMessageHistory(url="redis://localhost:6379/0")

# Initialize LLM
llm = ChatOpenAI(model_name="gpt-4")

# Define the agent
tools = [get_flights, book_flight_tool]
agent = initialize_agent(
    tools, llm, agent="zero-shot-react-description", verbose=True, memory=memory
)

# Example query
response = agent.run("I want to book a flight from New Delhi to New York for next Monday.")
print(response)

How AI Agents handle memory

The agent writes conversation history into Redis using the RedisChatMessageHistory memory store. Specifically, it stores messages exchanged between the user and the agent, allowing the AI to maintain context across interactions.

What Gets Stored in Redis?

User Messages: The queries or requests made by the user (e.g., "I want to book a flight from New Delhi to New York for next Monday.").
Agent Responses: The replies generated by the AI (e.g., "Here are the available flights for your route.").
Contextual Memory: If the user continues the conversation (e.g., "Book the first one."), the agent remembers the previous flight options presented.

How Redis Stores the Data?

The RedisChatMessageHistory class stores messages as a key-value structure in Redis.
Each user session is typically associated with a unique key (e.g., chat:).
Messages are stored in chronological order, allowing retrieval for context-based responses.

Example of Stored Data in Redis

{"chat:session_123": [
        {"role": "user", "message": "I want to book a flight from New Delhi to New York for next Monday."},
        {"role": "agent", "message": "Here are the available flights: AI 101 - $500, UA 202 - $550."},
        {"role": "user", "message": "Book AI 101."},
        {"role": "agent", "message": "Your booking for AI 101 is confirmed."}
    ]
}

The agent utilizes the message history stored in Redis to maintain context and continuity in the conversation. Here’s how it works:

1. Retrieving Conversation History

The RedisChatMessageHistory memory store acts as a persistent message history. Each time the user interacts with the agent, it retrieves past interactions from Redis, allowing it to remember the conversation.

When a user starts a new session, Redis retrieves previous messages using a unique session key (e.g., chat:).
The LangChain memory module feeds this history into the LLM, enabling it to generate responses based on past exchanges.

2. Contextual Understanding

Since the agent maintains history, it can:

✅ Understand Follow-up Queries
If a user says:

User: "I want to book a flight from New Delhi to New York for next Monday."
Agent: "Here are the available flights: AI 101 - $500, UA 202 - $550."
User: "Book the first one."

The agent remembers "AI 101" as the first option without needing the user to repeat.

✅ Maintain Personalization
If a user previously requested vegetarian meals or window seats, the agent can recall this preference.

✅ Handle Multi-turn Conversations

User: "What’s my booking status?"
Agent: (retrieves previous booking confirmation) "Your flight AI 101 is confirmed."

3. How LangChain Uses History?

LangChain’s memory mechanism ensures that past interactions are passed as part of the conversation context.

Example without memory:
- User: "Book the first one."
- Agent: "I don’t understand. Which flight?"
Example with memory:
- User: "Book the first one."
- Agent: (Remembers previous options) "Your flight AI 101 is confirmed."

Guide to Choosing the Right Database for Your App

Anish Ratnawat — Sat, 12 Oct 2024 18:30:00 GMT

Selecting the appropriate data store for your application is crucial. Here’s an overview of database types and when to choose each for specific use cases:

Relational Databases (RDBMS)

Characteristics:
- Organize data into tables with predefined schemas.
- Support strong ACID (Atomicity, Consistency, Isolation, Durability) properties.
- Use SQL for querying.
- Best for applications needing complex queries, transactions, and structured data.
- Sharding is possible in it but not well supported (no built in )
  - SQl guarantee consistency but wait all shard to agree on transaction can be costly.
Databases:
- Microsoft SQL Server, PostgreSQL, MySQL, Oracle Database.
When to Choose:
- Applications with structured data
- Applications with fixed/strict schemas.
- Scenarios needing complex relationships between entities (e.g., e-commerce, banking) or complex joins.
- Use cases requiring strict consistency and ACID transaction support.
Examples
- Inventory/Order/Reporting management
- Accounting/ Banking

NoSQL Databases

When to Choose:

Applications with Semi-Structure data
Applications with Dynamic schema
Not much need of complex joins
Store many TB's of data and highly scalable

Key-Value Stores

Characteristics:
- Simplest type of NoSQL database.
- Data is stored as key-value pairs.
- Optimized for fast reads and writes.
Databases:
- Redis, DynamoDB.
When to Choose:
- Data is accessed using a single key, like a dictionary.
- No joins, lock, or unions are required.
- No aggregation mechanisms are used.
- Quick lookup and relationship are minimal.
Example:
- Caching, session management, and real-time analytics.

Document Databases

Characteristics:
- Store semi-structured data as JSON or BSON documents.
- Allow flexible schemas and hierarchical data.
Databases:
- MongoDB, Dynamodb, cosmosDB
When to Choose:
- Flexible Schemas are required and index varies required on multiple fields
- Scenarios where data structure varies across records.
Example:
- Content management systems, product catalogs, or applications needing flexible schemas.

Column-Family Stores

Characteristics:
- Data is stored in tables but optimized for columnar storage instead of rows. Each column is a part of column family.
- Scalable and efficient for write-heavy workloads.
- Update and delete operations are rare.
- Designed to provide high throughput and low-latency access
Databases:
- Cassandra, HBase, ScyllaDB.
When to Choose:
- Heavy write operations
- High throughput and low latency
Example:
- Recommendations, Personalization, Sensor data, Telemetry, Messaging, Social media analytics, Activity monitoring, Weather and other time-series data

Graph Databases

Characteristics:
- Represent data as nodes and edges, ideal for modeling relationships.
- Use graph-based querying languages like Gremlin or Cypher.
Examples:
- Neo4j, Cosmos DB (Gremlin API), Amazon Neptune.
When to Choose:
- Social networks, recommendation engines, fraud detection.
- Use cases where relationships are critical and highly interconnected.

Time-Series Databases

Characteristics:
- Designed to handle time-stamped or sequential data.
- Optimized for time-series analysis and aggregation.
Examples:
- InfluxDB, TimescaleDB, OpenTSDB.
When to Choose:
- IoT sensor data, financial transactions, performance monitoring.

Search Databases

Characteristics:
- Optimized for full-text search and analysis.
- Provide advanced query and indexing capabilities for text.
Examples:
- Elasticsearch, Solr, Azure Cognitive Search.
When to Choose:
- Applications needing text search or log analysis.
- Use cases like e-commerce product search or log monitoring.

Object Storage

Characteristics:
- Designed for storing large, unstructured data (files, images, videos).
- Offers flat namespaces and metadata for objects.
Examples:
- Amazon S3, Azure Blob Storage.
When to Choose:
- Media storage, backups, archival, or big data pipelines.

In-Memory Databases

Characteristics:
- Data is stored in memory for ultra-fast access.
- Often used as caching layers.
Examples:
- Redis, Memcached.
When to Choose:
- Real-time analytics, session stores, or scenarios requiring low-latency data access.

Choosing the Right Database

Workload Type:
- OLTP (Online Transaction Processing): Relational or document databases.
- OLAP (Online Analytical Processing): Column-family or time-series databases.
Data Relationships:
- Strong relationships: Relational or graph databases.
- Weak or no relationships: NoSQL (key-value, document).
Scalability:
- Horizontal scalability: NoSQL databases.
- Vertical scalability: Relational databases.
Consistency vs. Availability:
- Strict consistency: Relational databases.
- Eventual consistency: NoSQL databases.
Schema Flexibility:
- Fixed schema: Relational databases.
- Flexible schema: Document or key-value stores.
Querying Needs:
- Complex queries: Relational or graph databases.
- Simple queries: Key-value or column-family stores.

By analyzing your use case across these dimensions, you can confidently choose the database that best fits your requirements.

Which one to choose when

Check if ACID is required then sql, otherwise nosql.
- Replication and sharding can be achieved in both. Sharding is bit difficult to achieve in SQL because no built in support.
High availability and compromise with consistency then nosql
Consider below factor while choosing databases:
- Have Structured data ? , Need complex Joins ?, Need Transaction , Consistency level, Need high Scalability ? (SJTCS)

References:

https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/data-store-overview

Understanding the CAP Theorem

Anish Ratnawat — Sat, 17 Aug 2024 18:30:00 GMT

What is CAP Theorem

In the world of distributed systems, the CAP theorem is a fundamental concept that guides the design and architecture of these systems. Proposed by Eric Brewer in 2000, the CAP theorem states that it is impossible for a distributed system to simultaneously guarantee all three of the following properties:

Consistency
Availability
Partition Tolerance

Consistency

Consistency ensures that every read request reflects the most recent write.

In other words, all nodes have the same view of the data/state at any given time. When a client queries the system, it always retrieves the latest data.

Availability

Availability ensures that every request( read(recent or non-recent) or a write) always receives a response, even if there is a node failure or partition breakdown. This means remains operational providing response to any query.

Partition Tolerance

Partition tolerance guarantees that the system continues to operate despite any number of communication breakdowns/ network partitions between the nodes. In a distributed environment, network partitions are inevitable due to hardware failures, network congestion, or other issues.

Deep Dive into CAP Theorem

A distributed system always needs to be partition tolerant, we shouldn’t be making a system where a network partition brings down the whole system.
So, a distributed system is always built Partition Tolerant.

So, In simple words, CAP theorem means if there is network partition and if you want your system to keep functioning you can provide either Availability or Consistency and not both.

How a Distributed System breaks Consistency or Availability?

Scenario 1: Multi-Node system where multi nodes capable of handing read/ write and nodes failure to propagate an update request to other nodes.

Consider a cluster with two nodes, N1 and N2, both capable of handling read and write requests.

In the diagram above, N1 receives an update request for id=2, modifying the salary from 800 to 1000. However, due to a network partition, N1 cannot propagate this update to N2.

When a read request is directed to N2, the node has two possible responses:

Respond with its current data (salary = 800) and later update the data when the network partition is resolved. This approach makes system available but not consistent.
Return an error, indicating it does not have the latest data. This ensures consistency by avoiding the return of stale data but compromises availability.

Scenario 2: Single-leader system for read and write operations

In a single-leader system, all read and write operations come to the leader, while other nodes remain synchronized with the leader and act as standby nodes in case the leader fails.

The challenge arises if the leader becomes disconnected from the cluster or clients cannot connect to it due to a network partition. In such cases, the system cannot process write requests until a new leader is elected, making the system consistent but not available during the transition.

But if system allows read request from Read replica then system can response even if there is master node failure, which makes system highly available but not consistent for reads.

A single-leader system that handles both reads and writes from master, should not be classified as highly available.

RDBMS(MySQL, Oracle, MS SQL Server, etc)

It’s no brainer that all RDBMS are Consistent as all reads and writes go to a single node/server.

How about availability? You might say, it is one single server and hence a single point of failure. So, how it’s categorized under Availability?

As I said earlier CAP-Availability is not the same as day to day availability/downtime we talk about. In a single node system, there will not be any network partition hence if the node is up, it will always return success for any read/write operation and hence available.

Thus, RDMS system can be Highly available and Consistent.

Trade-offs in CAP Theorem

The CAP theorem highlights three trade-off scenarios in distributed systems:

Consistency and Availability (CA):
Ensures identical data across all nodes and responsiveness to requests. Performance may be compromised during network issues to maintain data accuracy.
Consistency and Partition Tolerance (CP):
Prioritizes data consistency across nodes despite network partitions. The system may become temporarily unavailable to preserve data integrity.
Availability and Partition Tolerance (AP):
Focuses on staying operational during network disruptions. Sacrifices strict consistency, accepting temporary data inconsistencies to ensure accessibility.

Practical Implications

In real-world applications, the choice between consistency, availability, and partition tolerance depends on the specific use case:

Financial Systems: Strong consistency is critical to ensure accurate transactions.
Social Media Platforms: Prioritize availability, allowing users to interact with slightly stale data.
Global Systems: Partition tolerance is essential to maintain operations across distributed regions.

Understanding the CAP theorem and its trade-offs helps engineers design systems that align with the unique requirements of their applications, ensuring reliability and performance in distributed environments.

Probing the CAP Theorem

Can you only have 2 out of 3 CAP properties?
No, CAP means you must choose between Consistency and Availability during a partition, not abandon one entirely.
Does partition tolerance eliminate partition challenges?
No, it ensures operation during partitions but doesn’t resolve consistency or availability issues.
Example of a non-partition-tolerant system:
A centralized database or a multi-node system with synchronous replication halts during partitions due to dependency on full communication.
How to make systems partition-tolerant?
- Use eventual consistency to allow independent node decisions and reconcile later.
- Adopt asynchronous replication to accept writes without waiting for acknowledgment.
- Employ quorum-based systems for majority agreement.
Is partition tolerance optional?
No, distributed systems must handle partitions; the trade-off is between consistency and availability.
What are CA systems?
CA systems prioritize consistency and availability but fail during partitions, making them non-partition-tolerant.
Does 99.999% uptime mean high availability?
Not in CAP terms. Availability requires every request to a non-failing node to receive a valid response, even during partitions.
Do timeout errors count as availability?
No, errors or timeouts compromise availability in CAP’s definition.
Does eventual consistency meet CAP's consistency?
No, CAP’s consistency refers to strong consistency, which eventual consistency does not satisfy.
Does relaxing consistency always lead to eventual consistency?
Not always; it might result in unresolved inconsistencies without conflict resolution mechanisms.
Can strong consistency be achieved with a majority quorum?
Yes, but it sacrifices availability, adhering to CAP’s trade-offs.
Does CAP apply to microservices?
Yes, CAP principles are relevant to microservices as well as distributed databases.
What if partition tolerance is ignored?
Ignoring partition tolerance works in systems with reliable networks but risks failure during real-world partitions.
When can partition tolerance be ignored?
In tightly controlled environments (e.g., single-node systems or highly reliable networks), partitions are negligible. Examples: MySQL on a single server or Google Spanner with controlled infrastructure.

Understanding Caching and Cache Strategies

Anish Ratnawat — Sat, 03 Aug 2024 18:30:00 GMT

In the world of software engineering and distributed systems, caching is a fundamental technique for improving performance and scalability. By storing frequently accessed data closer to the user or application, caching reduces latency, minimizes load on backend systems, and enhances the overall user experience. In this article, we’ll explore the basics of caching, common cache strategies, and best practices for implementing an effective caching solution.

What is Caching?

Caching is the process of storing a copy of data in a temporary storage location, called a cache, so that it can be retrieved more quickly on subsequent requests. Caches are typically placed in-memory, which allows for faster read/write operations compared to disk-based storage or database queries.

Caching is widely used in various layers of an application stack, including:

Database caching: To reduce query execution time.
Application caching: To store results of expensive computations.
Content delivery network (CDN): To cache static resources like images, CSS, and JavaScript files closer to the user.

Cache Strategies

1. Write-through

In the write-through strategy, every write operation is applied to both the cache and the underlying data store. This ensures that the cache and the database remain consistent.

Process:
1. Write data to the cache.
2. Propagate the write to the database.
Pros:
- Ensures consistency between cache and database.
Cons:
- Slower write operations due to dual writes.
- Potentially redundant cache entries if the data is infrequently read.

2. Write-back (Write-behind)

In this strategy, write operations are performed on the cache, and asynchronously write to the datastore later.

Process:
1. Write data to the cache.
2. Periodically flush changes from the cache to the database.
Pros:
- Faster writes as only the cache is updated initially.
- Reduces write load on the database.
Cons:
- Risk of data loss if the cache is not properly persisted before failure.

3. Write-Around

In this strategy, write operations are performed on the datastore, bypassing the cache. We do cache aside load for this.

Cache-aside Load

If there is cache miss for the record then it will load data into the cache. The application code is responsible for checking the cache first before fetching data from the source of truth (e.g., a database).

Process:
1. Check if the data is in the cache.
2. If found, return the data.
3. If not, fetch the data from the database, store it in the cache, and return it.
Pros:
- Simple to implement.
- Provides fine-grained control over cache behavior.
Cons:
- Potential for stale data if not properly invalidated.

Cache Eviction Policies

Caching systems have limited storage, so eviction policies determine which data to remove when the cache is full. Common eviction policies include:

Least Recently Used (LRU): Evicts the least recently accessed items first.
Least Frequently Used (LFU): Evicts items accessed the least number of times.
First In, First Out (FIFO): Evicts items in the order they were added.
Random: Evicts random items to reduce complexity.

Best Practices for Caching

Use Appropriate Expiration Times:
- Set reasonable TTL values to avoid serving stale data.
Monitor Cache Performance:
- Continuously track hit/miss rates to evaluate effectiveness.
Implement Cache Invalidation Strategies:
- Use mechanisms like versioning or explicit invalidation to ensure data consistency.
Avoid Over-Caching:
- Cache only what is necessary to prevent excessive memory usage.
Secure Your Cache:
- Use encryption and access controls to protect sensitive data.

Basics of Content Delivery Network

Anish Ratnawat — Sat, 20 Jul 2024 18:30:00 GMT

A CDN is a distributed network of servers strategically placed across different geographical locations. These servers work together to deliver content, such as HTML pages, JavaScript files, stylesheets, images, and videos, to users based on their proximity to a server.

By reducing the distance between users and servers, CDNs minimize latency, improve load times, and enhance the overall user experience.

How Does a CDN Work?

At its core, a CDN works by caching content on multiple servers spread across various geographical locations, also known as edge servers. Here’s a step-by-step breakdown of how a CDN operates:

Content Caching: The origin server uploads content to the CDN’s edge servers in case of Push CDN.
User Request: In case of Pull CDN, When a user requests a resource (e.g., a webpage or an image), the request is routed to the nearest CDN edge server based on the user's location.
Cache Lookup: The edge server checks if the requested content is cached.
- If cached, the content is delivered directly to the user, ensuring minimal latency.
- If not cached, the edge server retrieves the content from the origin server, serves it to the user, and caches it for future requests.
Content Delivery: The content is delivered to the user from the edge server, reducing load on the origin server and enhancing the user experience.

This distributed approach ensures faster load times, reduced bandwidth costs, and improved reliability, even during traffic spikes.

CDN Architecture

A typical CDN architecture consists of the following components:

Origin Server: The central repository where the original content is stored, typically the website’s hosting server.
Edge Servers: Distributed servers located in various geographical locations. These servers cache content to serve users from the nearest possible location.
Points of Presence (PoPs): Physical data centers housing edge servers, strategically placed to maximize coverage and minimize latency.
Load Balancer: Distributes incoming traffic across multiple servers to prevent overloading any single server.
Content Routing Mechanism: Uses algorithms to direct user requests to the most optimal edge server based on factors like proximity, server health, and cache availability.
Analytics and Monitoring Tools: Collect data on performance metrics, user behavior, and system health, providing actionable insights for optimization.

CDN providers typically do not dedicate a single edge server to a specific origin server. Instead, edge servers are shared among multiple origin servers and content providers. This shared infrastructure approach allows CDNs to maximize resource utilization, distribute traffic efficiently, and offer cost-effective solutions to their clients.

However, some enterprise-level CDN services, such as Akamai or AWS CloudFront, may provide dedicated or isolated resources for specific high-demand clients. This could include private caching configurations or dedicated PoPs (Points of Presence) for clients with unique security, compliance, or performance requirements.

This modular architecture enables CDNs to deliver content efficiently, handle high traffic volumes, and provide resilience against server outages or DDoS attacks.

Key Benefits of Using a CDN

Reduced Latency: Faster load times for end-users.
Improved Availability: Enhanced uptime and reliability.
Reduced Server Load: Offloads traffic from the origin server.
Better Scalability: Handles traffic spikes efficiently.

Push vs. Pull CDNs

CDNs can operate in two primary modes: push and pull. Each mode has its unique use cases, advantages, and trade-offs.

Push CDN

In a push CDN, the content is manually uploaded to the CDN's servers by the content provider. The provider is responsible for ensuring that the latest version of the content is available on the CDN.

How It Works:

The website owner uploads content (e.g., images, videos) to the CDN.
The CDN stores this content in its servers.
When a user requests the content, it is served directly from the CDN’s servers.

Example Use Case: A media company hosting high-quality videos might use a push CDN to pre-upload their videos to ensure users always get the best experience without latency.

Pull CDN

In a pull CDN, content is fetched dynamically from the origin server and cached on the CDN’s edge servers when a user requests it for the first time.

How It Works:

A user requests content.
If the content is not already cached in the CDN, it is fetched from the origin server.
The fetched content is cached for subsequent requests.

Example Use Case: An e-commerce website with frequently updated product images and descriptions can leverage a pull CDN to ensure users always receive the latest content.

Key Differences Between Push and Pull CDNs

Feature	Push CDN	Pull CDN
Content Upload	Manual push	Automatic pull by CDN
Best For	Static, infrequently updated content	Dynamic, frequently updated content
Initial Latency	Low	Higher (during the first request)
Management Effort	Higher	Lower
Cost Predictability	More predictable	Depends on cache hit/miss ratio

Hybrid Approach

Some CDN providers offer a hybrid model, combining the best of both push and pull CDNs. This allows businesses to push critical static assets while relying on pull mechanisms for dynamic content.

Real-World CDN Providers

Cloudflare: Primarily operates as a pull CDN, suitable for dynamic websites.
Akamai: Offers both push and pull CDN configurations for enterprise-level applications.
Amazon CloudFront: Supports a hybrid approach with extensive customization options.

How CDN Sends Analytics Back to the Server

CDNs not only deliver content efficiently but also provide valuable analytics to help content providers monitor performance and user behavior. Here's how it works:

Data Collection: The CDN edge servers collect data on metrics such as user location, content type, request times, cache hits and misses, and bandwidth usage.
Aggregation: This data is aggregated in real-time or near real-time to provide a comprehensive view of content delivery performance.
Transmission to the Origin Server: The CDN transmits this aggregated data back to the origin server or a centralized analytics system, often via APIs or dashboards.
Actionable Insights: Content providers can use these insights to optimize delivery strategies, improve cache efficiency, and enhance user experience.

For example, a streaming platform can monitor which regions have the most users experiencing latency, allowing them to strategically deploy additional edge servers in those locations. Analytics also help in identifying popular content, assisting in targeted marketing campaigns.

Latency Numbers reference for System Design

Anish Ratnawat — Sat, 20 Jul 2024 18:30:00 GMT

Latency numbers can provide valuable context during system design , especially when discussing performance optimization, scalability, and trade-offs. Here are some common latency numbers worth referencing:

Modern Hardware Limits

Today’s servers have massive capacities that change the "distributed vs. single machine" trade-off.

Compute/Memory: Standard high-end instances (like AWS M6i) offer 128 vCPUs and 512 GB of RAM, while specialized instances can go up to 24 TB of RAM.
Storage: Local SSDs can handle 60 TB on a single instance, and HDDs can reach over 300 TB.
Networking: 25–100 Gbps is standard within data centers. Latency is sub-1ms within an Availability Zone (AZ) and ~1-2ms between AZs.

Component Capacities (Single Node)

Component	Modern Capacity / Throughput
Server Memory (High-end)	Up to 4 TB (Standard) to 24 TB (Metal)
Local SSD Storage	60 TB+ (e.g., AWS i3en instances)
Database Storage (Single Node)	5-10 TB (before sharding is strictly necessary)
SQL Writes (Postgres/MySQL)	10k - 50k writes/sec (well-tuned)
SQL Reads (Indexed)	100k+ reads/sec
Redis Throughput	100k - 1M operations/sec
App Server Connections	10k - 50k concurrent connections (Async I/O)
Network Bandwidth	25 Gbps - 100 Gbps

Key Rules of Thumb

The "1TB Rule": If your total dataset is under 1TB, it can likely fit entirely in the RAM of a few high-memory cache nodes or on the disk of a single modern database instance.
The "Sharding Threshold": Don't suggest sharding a database for storage reasons unless you exceed 5-10 TB. Don't shard for write throughput unless you exceed 20k-50k writes/second.
The "Cache-First" Fallacy: Modern NVMe SSDs are so fast (10-50μs) that if your database query is a simple primary key lookup, you might not even need Redis for performance; use it for scaling read-heavy traffic or reducing DB load instead.
Concurrency: One single modern application server can handle almost any "mid-sized" startup’s total traffic. When designing for millions of users, think in dozens of servers, not thousands.

Basic Operations

L1 cache reference: ~1 nanoseconds
L2 cache reference: ~7 nanoseconds
Main memory (RAM) reference: ~0.1 milliseconds
SSD I/O (read/write): ~100 microseconds
Disk I/O (HDD, seek): ~10 milliseconds

When you read from a database or a remote cache, you aren't just paying for the time it takes to find the data; you are paying for the round-trip journey.

Remote Cache (e.g., Redis on a separate VM):
- Internal Processing: ~0.1 ms
- Network Overhead: ~0.5 ms to 1.0 ms (within the same Availability Zone)
- Total: ~1.1 ms
Remote Database (e.g., Postgres/MySQL):
- Internal Processing: ~5 ms to 50 ms (index lookup + disk I/O)
- Network Overhead: ~0.5 ms to 1.0 ms
- Total: ~5.5 ms to 51 ms

Data Processing

Reading 1 MB from RAM: ~250 microseconds
Reading 1 MB from SSD: ~1 millisecond
Reading 1 MB from HDD: ~10 milliseconds

Network Latencies

1 KB data transfer on 1 Gbps network: ~10 microseconds
Round trip within the same AZ: < 1 milliSecond
Round trip between cross AZ (same region): ~ 1-2 ms
Round trip between two data centers (different region): ~ 60-200 ms depends on distance
Round trip between inter-continent: ~ 150 ms depends on distance

Cloud Services

API gateway call latency: ~1-10 milliseconds
Query on a NoSQL database (e.g., DynamoDB): ~5-20 milliseconds
Query on an SQL database: ~5-10 milliseconds for simple queries; complex queries can take seconds.

FAQ

Main memory (RAM) reference is 100 nanoseconds but Reading 1 MB from RAM is 250 microseconds, explain ?

Answer:

100 ns: Time to access a single memory location (latency), which is to fetch a small chunk of data (e.g., 64 bytes).

250 µs: Time to read 1 MB, including latency and transfer time.
- Modern RAM modules have high bandwidth, often in the range of tens of GB/s. For example:
- Assume a memory bandwidth of 20 GB/s (DDR4/DDR5 range).
- Time to transfer 1 MB = 1 MB / 20 GB/s=2^20 bytes / 20×10^9 bytes/s ≈ 50 μs
However, the transfer process also incurs latency overheads for accessing multiple addresses and managing the memory bus, which is why the total time to read 1 MB is closer to ~250 microseconds rather than the raw bandwidth estimate.
Disk I/O (HDD, seek) is 10 milliseconds and Reading 1 MB from HDD is also 10 milliseconds, why ?

Answer:

Disk I/O refers to the seek time, which is the delay required for the hard disk drive (HDD) to position its read/write head over the correct track on the spinning disk. This latency happens before any data is read and is independent of the data size.

Reading 1 MB from HDD: ~10 milliseconds
- This is the total time required to read 1 MB of data from the disk, including:
  1. Seek time (~10 ms): Positioning the read head.
  2. Data transfer time: Time to physically transfer 1 MB from the spinning disk to memory.
  - Modern HDDs have sequential read speeds of ~100 MB/s. Therefore:
    - Transfer time for 1 MB = 1 MB^ 100 MB/s=0.01 seconds=10 ms
For small reads (e.g., a few KB or even 1 byte), the seek time dominates, so the total latency is still close to 10 ms.

For larger reads (e.g., 1 MB), the transfer time adds to the seek time, but because the transfer speed is high, it doesn’t increase latency significantly for moderate data sizes like 1 MB.

Understanding Gateway, Load Balancer, Forward Proxy, and Reverse Proxy

Anish Ratnawat — Sat, 15 Jun 2024 18:30:00 GMT

Gateway

A gateway acts as a single entry point into a system. It is commonly used in microservices architectures to route client requests to appropriate backend services. Gateways often incorporate additional functionalities like:

Authentication and Authorization: Ensuring only authorized users access certain resources.
Protocol Translation: Converting protocols like HTTP to WebSocket or gRPC.
Request Aggregation: Combining responses from multiple services into a single response.
Service Discovery: process of finding and locating available service instances within a distributed system.
Rate Limiting: technique that limits the number of requests an API can handle in a given time frame.
SSL Termination: process of decrypting encrypted traffic before passing it along to a web server.

Popular Tools:

API Gateway: Tools like Kong, Apigee, or AWS API Gateway manage APIs by abstracting backend services and enforcing policies.
Service Gateway: Tools like Istio function within service meshes, facilitating inter-service communication and applying policies.

Load Balancer

A load balancer distributes incoming network traffic across multiple servers to ensure high availability and reliability. It helps achieve fault tolerance, scalability, and optimal resource utilization.

Types of Load Balancers:

Layer 4 Load Balancers: Operate at the transport layer (TCP/UDP) and use information like IP address and port for routing.
- Example: AWS Elastic Load Balancer (Classic).
Layer 7 Load Balancers: Operate at the application layer, making routing decisions based on HTTP headers, URLs, and other application-level data.
- Example: NGINX, HAProxy.

Load Balancing Algorithms:

Round Robin: Requests are distributed sequentially across servers.
Least Connections: Routes to the server with the fewest active connections.
IP Hash: Routes based on client IP, ensuring session persistence.

Gateway vs Load Balancers

There are two scenarios to consider here to clarify the confusion. I have clarified this using microservices example as this would make sense there only.

Scenario 1: You have a cluster of API Gateways

User ---> Load Balancer (provided by Cloud Providers like AWS or your own) ---> API Gateway Cluster ---> Service Discovery Agent (like eureka) ---> Microservice A ---> Client Side Load Balancer ---> Microservice B

Scenario 2: You have a single API Gateway

User ---> API Gateway ---> Service Discovery Agent (like Eureka) ---> Microservice A ---> Client Side Load Balancer -> Microservice B

I hope you understand why we required Load Balancer before the API Gateway in Scenario 1, as there we had multiple instances of API gateway also to handle the large traffic and to avoid the burden on the single api gateway since gateway itself can have several task to manage as per the requirements, so to distribute the load among them, we have load balancer.

Forward Proxy

A forward proxy Forward proxy acts as a request server that hides client identify and execute client request on behalf on client hiding client IP.. It is typically used for:

Caching: Storing frequently accessed data to reduce latency.
Anonymity: Masking the client's identity.
Content Filtering: Blocking access to certain websites.

Use Case Example: An organization might use a forward proxy to allow employees to access the internet while restricting access to non-work-related sites.

Reverse Proxy

A reverse proxy operates on behalf of servers, handling requests from clients and forwarding them to appropriate backend servers. Common functionalities include:

Load Balancing: Distributing traffic among servers.
SSL Termination: Offloading SSL decryption to reduce server load.
Caching: Storing responses to reduce server load and latency for repeated requests.
Security: Hiding backend server details and blocking malicious traffic.

Popular Tools: NGINX, Apache HTTP Server, Traefik.

Use Case Example: A reverse proxy can sit in front of an application server, handling SSL termination, caching, and load balancing.

REST vs RPC vs HTTP vs TCP vs UDP: Understanding the Differences

Anish Ratnawat — Sat, 04 May 2024 18:30:00 GMT

REST, RPC, HTTP, TCP, and UDP—each operates at different levels of abstraction and serves different purposes in network communication:

📦 1. TCP (Transmission Control Protocol)

Type: Transport Layer Protocol (OSI Layer 4)
Purpose: Reliable, ordered, and error-checked delivery of data between applications
Use Cases: Web (HTTP), Email (SMTP), FTP
Key Features:
- Connection-oriented
- Guarantees packet delivery
- Slower due to overhead (acknowledgements, retransmission, flow control)

💨 2. UDP (User Datagram Protocol)

Type: Transport Layer Protocol (OSI Layer 4)
Purpose: Fast, connectionless communication
Use Cases: Video streaming, online gaming, DNS, VoIP
Key Features:
- No guarantee of delivery or order
- No connection setup — lightweight and fast
- Suitable for latency-sensitive apps

🌐 3. HTTP (Hypertext Transfer Protocol)

Type: Application Layer Protocol (built on TCP)
Purpose: Transmit hypermedia (HTML, JSON, etc.) between clients and servers
Use Cases: Web APIs, browsers, REST APIs
Key Features:
- Stateless, request-response protocol
- Typically runs on port 80 (HTTP) or 443 (HTTPS)
- Built on top of TCP

Note: HTTP is often used as the transport layer for both REST and RPC.

🔁 4. RPC (Remote Procedure Call)

Type: Programming concept / communication pattern
Purpose: Execute a function/procedure on a remote server as if it's local
Use Cases: gRPC, Thrift, XML-RPC, JSON-RPC
Key Features:
- Client invokes remote methods directly
- Abstracts transport layer details
- Can be tightly coupled (harder to evolve over time)

Note: gRPC uses protobuf for data serialisation which serialise data into binary format. Protobuf alone can be used with HTTP as a replacement of JSON data.

🌱 5. REST (Representational State Transfer)

Type: Architectural style using HTTP
Purpose: Build scalable and loosely-coupled web APIs
Use Cases: Public APIs, microservices communication
Key Features:
- Resource-based (GET /users/1, POST /orders)
- Stateless and cacheable
- Uses HTTP verbs (GET, POST, PUT, DELETE)

🧠 Summary Comparison Table

Feature	TCP	UDP	HTTP	RPC	REST
Layer	Transport	Transport	App	App concept	App concept
Reliability	Yes	No	Yes	Depends	Yes
Protocol Style	Stream	Datagram	Request/Response	Function Call	Resource-based
Transport Used	N/A	N/A	TCP	TCP/HTTP/Custom	HTTP
Speed	Moderate	Fast	Moderate	Fast	Moderate
Use Case	Raw data	Real-time	Web APIs	Microservices	Web APIs

🤔 In Practice

TCP vs UDP = how data is transferred
HTTP = how clients/servers communicate over the web
REST vs RPC = how APIs are designed
REST over HTTP is a common web API pattern
RPC can be over HTTP (e.g., gRPC with HTTP/2), or directly on TCP

Choosing Between REST and RPC

The choice between REST and RPC boils down to the needs of your application:

Choose REST if simplicity, compatibility, and resource orientation are key.
Choose RPC if performance, compact payloads, and action orientation are critical.

Case Study of LinkedIn latency optimization by 60%

LinkedIn significantly improved its latency—by up to 60%—by replacing JSON with Protocol Buffers (Protobuf) for data serialization. Here’s how they achieved this:

1. Why Did LinkedIn Replace JSON?

JSON is widely used for serialization due to its human readability and simplicity, but it has several drawbacks:

High serialization/deserialization time: JSON relies on text-based encoding, which requires expensive parsing.
Large payload sizes: JSON data is verbose due to repeated keys and lack of efficient binary encoding.
More Network Bandwidth: Json consumes more network bandwidth and thus increases the latency.
High CPU usage: Serialization and deserialization are computationally expensive, especially for large-scale distributed systems.

LinkedIn, handling billions of requests per day, faced latency issues and increased infrastructure costs due to these inefficiencies.

2. How Did Protobuf Help?

a) Compact Binary Encoding

Protobuf is a binary format, which means it requires less bandwidth and less memory for transmission compared to JSON.
JSON includes redundant key names, while Protobuf uses numeric field tags, reducing data size significantly.

b) Faster Serialization & Deserialization

JSON requires string parsing, while Protobuf directly maps to efficient binary representations, leading to faster encoding/decoding.
This improves CPU efficiency and reduces garbage collection overhead in JVM-based applications.

c) Schema Evolution Without Breaking Changes

Protobuf supports backward and forward compatibility, allowing LinkedIn to evolve APIs smoothly without impacting older clients.
JSON lacks built-in schema enforcement, increasing the risk of breaking changes.

3. Measured Performance Gains

LinkedIn observed the following improvements after switching to Protobuf:

Latency reduced by 60% (mostly due to faster serialization/deserialization).
Payload size reduced by 50-80%, leading to lower network bandwidth usage.
CPU utilization dropped, allowing better resource utilization.

4. Where Did LinkedIn Apply Protobuf?

LinkedIn initially introduced Protobuf in its Venice key-value store and later expanded it to other services such as:

Rest.li (LinkedIn's API framework)
Kafka messages for event streaming
Inter-service communication within microservices

5. Lessons for Other Companies

If your system is high-scale and latency-sensitive, switching from JSON to Protobuf can:

Improve API performance in microservices.
Reduce cloud/server costs due to lower CPU and bandwidth usage.
Enhance data consistency with schema enforcement.

However, Protobuf is not human-readable, which can make debugging harder compared to JSON. For applications requiring human interaction with APIs (e.g., REST APIs for web clients), JSON may still be preferable.

Reference: https://www.linkedin.com/blog/engineering/infrastructure/linkedin-integrates-protocol-buffers-with-rest-li-for-improved-m

Partitioning vs Sharding: Key Concepts for Scalable Systems

Anish Ratnawat — Sat, 20 Apr 2024 18:30:00 GMT

In the realm of distributed systems and databases, partitioning and sharding are two terms that often come up when discussing scalability and performance. While they share similarities, they serve distinct purposes and are implemented differently. This blog explores the nuances of partitioning and sharding, their use cases, and how to choose the right approach for your system.

What is Partitioning?

Partitioning is the process of dividing a dataset into smaller, more manageable pieces called partitions. These partitions are stored separately but are part of the same database or storage system. Partitioning can improve performance, manageability, and scalability by reducing the size of data that needs to be handled by any single operation.

Types of Partitioning

Horizontal Partitioning:
- Data is split by rows.
- Each partition contains a subset of the rows, often based on a range or a key.
- Example: Splitting user data based on user IDs (e.g., 1–1000 in Partition A, 1001–2000 in Partition B).
Vertical Partitioning:
- Data is split by columns.
- Different partitions store subsets of the attributes (columns).
- Example: Separating frequently accessed columns into one table and less-used columns into another.
List Partitioning:
- Data is partitioned based on a list of values.
- Example: Orders partitioned by regions, like North, South, East, and West.
Hash Partitioning:
- A hash function determines the partition for each data entry.
- Example: Using a hash of the user ID modulo the number of partitions.

What is Sharding?

Sharding is a subset of partitioning that involves distributing data across multiple independent databases or nodes. Each shard is a self-contained unit with its own database instance, enabling horizontal scaling and fault isolation.

Key Characteristics of Sharding

Independent Databases:
- Each shard operates as a standalone database with its own schema and storage.
- Example: Shard 1 might store data for users with IDs 1–1000, while Shard 2 handles IDs 1001–2000.
Scalability:
- Sharding allows the system to scale out by adding more nodes as the dataset grows.
Fault Isolation:
- Issues in one shard (e.g., hardware failure) do not directly impact other shards.
Custom Shard Keys:
- The shard key determines how data is distributed across shards. A poorly chosen shard key can lead to uneven distribution and hotspots.

Key Differences Between Partitioning and Sharding

Aspect	Partitioning	Sharding
Scope	Divides data within a single database instance.	Distributes data across multiple databases.
Complexity	Easier to implement and manage.	More complex, especially with distributed systems.
Scaling	Vertical scaling (limited by a single instance).	Horizontal scaling (adding more nodes).
Fault Isolation	Single point of failure in the database instance.	Isolated faults due to independent shards.
Performance	Limited by the capacity of one database.	Scales with the number of shards.

Choosing Between Partitioning and Sharding

When deciding between partitioning and sharding, consider the following:

Dataset Size:
- Use partitioning if your dataset can fit within a single database instance but needs optimization.
- Use sharding if your dataset is too large for a single instance.
Scaling Needs:
- If you anticipate significant growth, sharding offers better horizontal scalability.
Complexity vs. Benefits:
- Partitioning is simpler but limited in scalability.
- Sharding requires more effort but enables handling massive datasets.
Fault Tolerance:
- If fault isolation is crucial, sharding is the better choice.

Real-World Examples

Partitioning:
- A retail application partitions order data by year to speed up queries for recent transactions.
Sharding:
- A social media platform shards user data by user ID to ensure that no single database becomes a bottleneck.

Conclusion

Partitioning and sharding are essential techniques for building scalable, high-performance systems. While partitioning focuses on dividing data within a single database, sharding takes it a step further by distributing data across multiple databases. Choosing the right approach depends on your system’s size, scaling needs, and complexity tolerance.

Understanding these techniques and their trade-offs will help you design robust systems that can handle growth efficiently.

An Overview of Basic, JWT, API Key, and OAuth Authentication Techniques

Anish Ratnawat — Sat, 17 Feb 2024 18:30:00 GMT

In the world of distributed systems and modern APIs, authentication plays a critical role in securing resources and validating users. Choosing the right authentication method depends on use cases, system architecture, and security requirements. This blog explores five popular authentication methods: Basic Authentication, JWT (JSON Web Tokens), API Keys, and OAuth, along with their use cases, pros, and cons.

1. Basic Authentication

Overview

Basic Authentication is a simple way to verify users in REST APIs by sending a username and password in HTTP headers. It’s easy to use but less secure, especially without HTTPS, making it unsuitable for sensitive data or production use.

Here's a quick summary of Basic Authentication in REST APIs:

Client Request: The client sends a request to the server with authentication details in the request headers.
Encoding: Username and password are combined as username:password and base64-encoded. Note: This is not encryption.
Header: The encoded credentials are added to the HTTP request header like this:

Authorization: Basic base64(username:password)

Server Check: The server decodes the header, retrieves the username and password, and verifies them.
Response: If valid, the server processes the request. If not, it returns a 401 Unauthorized error.

It’s important to use HTTPS when implementing Basic Authentication to encrypt the communication between the client and the server. The credentials are sent in plain text without encryption, making it vulnerable to threats.

Use Cases

Legacy systems or internal applications.
Simple APIs with low security concerns.

Pros

Simple to implement and use.
Supported by all major HTTP clients and browsers.

Cons

Credentials are sent with every request, increasing exposure risk.
Base64 encoding is not encryption so increasing security risk. Needs TLS for security.
No session management—every request re-sends credentials.

Real-Life Example

Accessing internal tools or staging environments using a browser pop-up prompt.

2. Token Based Authentication

Overview

Token authentication is more secure than basic authentication since it involves using a unique token generated for each user. JSON Web Tokens (JWT) is a popular token-based authentication method.

JWTs are self-contained and can store user information, reducing the need for constant database queries. This token is sent with each request to authenticate the user. Token authentication is also a good choice for applications requiring frequent authentication, such as single-page or mobile applications.

Since the authentication process does not require user passwords in each request, once a user enters the credentials, receive a unique encrypted token valid for a specified session time.it is more efficient and can handle more concurrent requests.

JWT (JSON Web Token) authentication works in a client-server interaction:

1. User Login

The client sends login credentials (username and password) to the server.
The server verifies the credentials.

2. Token Generation

Upon successful authentication, the server generates a JWT.
The token contains:
- Header: Specifies the token type (JWT) and signing algorithm.
- Payload: Contains user data (e.g., user ID) and claims.
- Signature: Ensures the token’s integrity using a secret key.

In a serialised form, JWT represents a string in the following format:

[header].[payload].[signature]

Actual JWT token looks like

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6MTIzNDU2Nzg5LCJuYW1lIjoiSm9zZXBoIn0.OpOSSw7e485LOP5PrzScxHb7SR6sAOMRckfFwi4rp7o

In deserialised form, JWT will be as per below:

{
  "header": { "alg": "HS256", "typ": "JWT" },
  "payload": { "sub": "1234567890", "name": "John Doe", "admin": true },
  "signature": "signed-data"
}

3. Token Delivery

The server sends the JWT to the client (often in the response body or a cookie).

4. Token Usage

The client includes the JWT in the Authorization header (e.g., Bearer ) of subsequent requests.
The token serves as proof of authentication.

Request Header has below:

Authorization: Bearer

5. Token Validation

The server validates the JWT by:
- Verifying the signature.
- Checking the token’s expiration and claims.

6. Access Granted

If the token is valid, the server processes the request and returns the response.
If invalid, the server denies access, often returning a 401 Unauthorized status.

Key Points:

JWTs are stateless, meaning the server doesn’t store session information.
Expired or invalid tokens require re-authentication.
Tokens can include additional claims for granular access control.

Use Cases

Short-lived API sessions.
Microservices communication.
Stateless authentication for web and mobile applications.
Single Sign-On (SSO) scenarios.

Pros

Self-contained and can carry user claims.
Stateless: No need to query the database after token issuance.
Supports token expiration and custom claims.

Cons

Larger token size compared to others.
If not properly invalidated, compromised tokens remain valid until expiry.
Need secure storage to avoid leakage.

Real-Life Example

Accessing APIs in cloud platforms (e.g., AWS, Azure) and microservice architecture.

3. API Key Authentication

Overview

API keys are unique strings assigned to each client, included in requests to identify the client. They can be passed via headers, query parameters, or request bodies.

Obtaining API Key: Clients request an API key from the API provider. This is usually done through a developer portal or some registration process.
Including API Key in Requests: Once the API key is obtained, it must be included in each API request.
```
 GET /api/resource?api_key=123abc
```
Server-Side Validation: The API server receives the request and extracts the API key from the specified location (URL parameter, header, etc.). The server checks the validity of the API key by comparing it against a list of authorized keys stored in its database.
Authorization Check: Once the API key is validated, the server checks if the associated client or application has the necessary permissions to perform the requested action.

While API keys are a straightforward authentication method, they have some limitations. One of the main concerns is that API keys can be exposed easily if not handled securely, leading to potential security risks. Therefore, following best practices, such as using HTTPS, avoiding exposure of keys in client-side code, and implementing proper critical management practices, is essential. Other authentication methods, like token-based authentication (JWTs), may be preferred for more sensitive applications.

Use Cases

Public APIs with limited scope access.
Service-to-service communication.

Pros

Simple and easy to use.
Can be restricted by IP or referrer for added security.

Cons

No built-in user identity (just a key).
Vulnerable to theft if exposed in public repositories or URLs.
Hard to revoke individual keys unless tracked explicitly.

Real-Life Example

APIs for weather, currency conversion, or third-party integration services.

5. OAuth 2.0

Overview

Applications often need to interact with each other on behalf of users. Whether it’s granting access to your Google account for a new app or allowing a third-party service to post to your social media, OAuth 2.0 has become the go-to solution for secure, seamless access delegation.

In this blog, we’ll explore the core concepts of OAuth 2.0, why it’s essential, and how it enables secure interactions between applications.

What is OAuth 2.0?

OAuth 2.0 (Open Authorization 2.0) is an open standard protocol designed to provide secure authorization for applications without exposing user credentials. It allows users to grant limited access to third-party application.

For example, when you sign in to an app using your Google account, OAuth 2.0 facilitates the process, ensuring the app gets access to your profile details without revealing your password.

Key Concepts in OAuth 2.0

OAuth 2.0 operates through several key roles:

Resource Owner (User): The individual who owns the data and can grant access to it.
Client (Application): The app requesting access to the user’s data.
Authorization Server: The server that authenticates the user and grants tokens.
Resource Server: The server that holds the user’s data and validates tokens for access.

Token-Based Access

OAuth 2.0 uses tokens instead of credentials to grant access. These tokens are temporary and can be tailored for specific permissions, making them more secure and flexible.

There are roughly 2 types of token:

Access Token: Used to access protected resources.
Refresh Token: Used to obtain a new access token when the current one expires.

Common OAuth 2.0 Grant Flows

Authorization Code Flow (most secure, used for server-side applications):
- Authorization Request: The client redirects the user to the authorization server to log in and approve access.
- Authorization Grant: The authorization server provides an authorization code.
- Token Request: The client exchanges the authorization code for an access token by making a back-channel request.
- Resource Request: The client uses the access token to access the protected resource.

User request for login and authenticate to get auth code

https://localhost:8080/realms/myrealm/protocol/openid-connect/auth ?response_type=code &client_id=myclient &redirect_uri=http://localhost:3000/callback &scope=openid

After successful authentication, auth server redirect the user to the redirect uri with authorization code

http://localhost:3000/callback?code=AUTHORIZATION_CODE

Once we get authorization code from the url

One can do below curl to get token

curl -X POST "http://localhost:8080/realms/myrealm/protocol/openid-connect/token" -H "Content-Type: application/x-www-form-urlencoded" -d "grant_type=authorization_code" -d "client_id=myclient" -d "client_secret=mysecret" -d "redirect_uri=http://localhost:3000/callback" -d "code=AUTHORIZATION_CODE"

Response

{ "access_token": "eyJhbGciOiJSUzI1NiIsInR5c...", "expires_in": 300, "refresh_token": "eyJhbGciOiJIUzI1NiIs...", "id_token": "eyJhbGciOiJSUzI1NiIsInR5c...", "token_type": "Bearer" }

Implicit Flow - deprecated (used for single-page applications):

The client directly receives the access token without an authorization code exchange.

Client Credentials Flow (used for machine-to-machine communication):

The client authenticates itself to the authorization server and directly obtains an access token.
It doesn’t require any callback url and auth code exchange. Authorization code flow requires callback url. Auth server returns JWT token.
User will pass clientId and client secret for authentication server.

curl -X POST "https://auth.example.com/oauth/token" -H "Content-Type: application/x-www-form-urlencoded" -d "client_id=myclient" -d "client_secret=mysecret" -d "grant_type=client_credentials"

Response

{ "access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI...", "expires_in": 300, "token_type": "Bearer" }

It doesn’t provide refresh token in cc flow.

Resource Owner Password Credentials Flow (discouraged due to security concerns):

The client directly collects the user’s credentials and exchanges them for an access token.

Common OAuth 2.0 Use Cases

Social Media Integration: Allowing apps to post on behalf of users.
Cloud Storage Access: Enabling apps to fetch files from services like Google Drive or Dropbox.
Payment Gateways: Granting access to payment platforms without sharing sensitive information.

Real life : A Step-by-Step Example

Let’s say you want to use a third-party app to analyze your Gmail data:

Authorization Request: The app redirects you to Google’s authorization server.
User Consent: You log in and grant permission.
Token Issuance: Google provides an access token to the app.
Data Access: The app uses the token to fetch your Gmail data securely.

Pros

Securely avoids sharing user credentials with third-party apps.
Supports Single Sign-On (SSO) for seamless user experience.
Provides granular access control through token scopes.
Offers token expiration and refresh for enhanced security.
Flexible for diverse use cases (e.g., mobile, web, API).
Widely adopted and integrated with major platforms.

Cons

Complex implementation increases the risk of errors.
Token storage and management require careful handling.
Implicit flow deprecation impacts older implementations.
Reliance on third-party providers may raise privacy concerns.

Choosing the Right Authentication Method

Authentication	Best for	Security	Ease of Implementation
Basic	Simple apps	Low	High
JWT	SPAs, SSO	High	Moderate
API Key	Public APIs	Low	High
OAuth 2.0	Third-party access	Very High	Low

Conclusion

Each authentication method has its own strengths and trade-offs. For simple use cases, API Keys or Basic Auth may suffice. For stateless, scalable systems, JWT is a strong choice. For delegated access and federated identity, OAuth 2.0 is a robust option. Evaluate your security needs and system architecture before choosing the most appropriate solution.

Have any questions or insights about these methods? Let’s discuss in the comments! 👇

Understanding WebSockets: A Beginner’s Guide

Anish Ratnawat — Sat, 20 Jan 2024 18:30:00 GMT

WebSockets are a modern technology that makes real-time communication over the internet fast and efficient. Let’s break it down and explore why WebSockets are essential, how they work, and how they differ from traditional protocols like HTTP.

What Are WebSockets?

WebSockets are a communication protocol that allows for full-duplex (two-way) communication between a client (like a web browser) and a server. This means both the client and the server can send and receive messages at any time without waiting for the other to finish.

Why Are WebSockets Used?

WebSockets are increasingly popular because they address the limitations of older protocols like HTTP for real-time communication. Here are the main reasons for their usage:

Real-Time Data Exchange: Ideal for applications requiring instant updates, such as chat apps, live sports scores, or stock market tracking.
Reduced Latency: Unlike HTTP, WebSockets keep the connection open, reducing the delay caused by repeatedly opening and closing connections.
Two-Way Communication: Enables seamless interaction, such as collaborative editing in documents or multiplayer online games.
Scalable Architecture: Efficient resource use makes WebSockets suitable for large-scale real-time applications.

How Do WebSockets Work?

Here’s a step-by-step explanation of how WebSockets operate:

Handshake:
- The client initiates a connection with an HTTP request with a special header (“Upgrade”).
- The server responds with status 101, if agreeing to switch the connection from HTTP to WebSocket. It responds with some error code if server doesn’t support webSocket upgrade.

Client request header look like this:

GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==
Sec-WebSocket-Version: 13

upgrade field has been set to websocket and connection field has been set to upgrade, which denotes that client is requesting a websocket connection over http. sec-websocket-key is the base64 encoded value that is generated randomly.

Server will receive the request, read the websocket-key and combine with globally unique identifier and do SHA-1 hash of concatenated string. This hash value will be part of field sec-websocket-Accept if it willing to accept the connection.

Server will send the below headers in the form of handshake to the client.

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: HSmrc0sMlYUkAGmm5OPpG2HaGWk=

Connection Established:
- Once the handshake is complete, the connection stays open.
Real-Time Communication:
- Both the client and server can send messages to each other anytime.
- Messages are exchanged using lightweight frames, ensuring efficiency.
Connection Closure:
- Either the client or the server can close the connection when the communication is complete.

WebSocket has a default URI format

ws-URI =  "ws:" "//" host [ ":" port ] path [ "?" query ]
wss-URI = "wss:" "//" host [ ":" port ] path [ "?" query ]

wss denotes a secure web-socket connection with TLS handshake. Separate port will be used on the client and the server side, decided by client and server.

How WebSocket different from HTTP

WebSockets and HTTP may seem similar since they both run over TCP, but their behavior is fundamentally different.

Feature	WebSockets	HTTP
Connection	Persistent (open for long-term)	Stateless (one request-response cycle)
Communication	Full-duplex (two-way)	Half-duplex (client initiates)
Overhead	Minimal (after handshake)	High (headers for every request)
Real-Time Suitability	Excellent	Requires polling/long-polling

The web-socket protocol is an TCP based protocol. It’s only relationship to HTTP is that its handshake is interpreted by HTTP servers as an upgrade request.

When Is HTTP Preferred Over WebSockets?

While WebSockets excel in real-time communication scenarios, HTTP is still preferred in many situations due to its simplicity and widespread use. Here are some cases where HTTP is a better choice:

Static Content Delivery:
- For serving static assets like HTML, CSS, JavaScript, or images, HTTP is more straightforward and efficient.
Simple Request-Response:
- For operations like form submissions, API calls, or fetching data where one request yields one response, HTTP is sufficient.
Short-Lived Connections:
- For actions that do not require a persistent connection, such as loading a webpage or making occasional API requests.
Browser and Server Compatibility:
- HTTP is universally supported and works seamlessly with all browsers, servers, and proxies.
Security and Caching:
- HTTP benefits from established security protocols and caching mechanisms, making it ideal for delivering resources efficiently.
Lower Complexity:
- HTTP does not require the additional implementation effort needed for managing WebSocket connections and messages.

Conclusion

WebSockets are a game-changer for applications that demand real-time communication. By enabling persistent and efficient two-way communication, WebSockets reduce latency and overhead, making them a go-to choice for modern web and mobile applications. While they might not replace HTTP entirely, they complement it by addressing specific use cases like live updates and interactivity. As real-time applications grow, understanding and leveraging WebSockets can be a significant advantage for developers.

HTTP/1.1 vs. HTTP/2 vs. HTTP/3: A Comprehensive Comparison, Limitations, and Adoption Trends

Anish Ratnawat — Fri, 29 Dec 2023 18:30:00 GMT

The Hypertext Transfer Protocol (HTTP) is the backbone of the web, evolving over decades to address performance, scalability, and security needs. Each version brought significant changes to overcome the limitations of its predecessor. This blog delves into the key differences between HTTP/1.1, HTTP/2, and HTTP/3, their limitations, and their adoption trends.

1. HTTP/1.1: The Foundation

In HTTP/1.0, every request to the same server requires separate TCP connection. HTTP/1.1 which was improvement over HTTP/1.0 introduces persistent connection and reuses the connection.

Key Features

Persistent Connections: Keeps the connection open for multiple requests, reducing TCP handshake overhead.
Pipelining (Theoretical): Allows multiple requests to be sent without waiting for responses, though rarely used due to head-of-line blocking.
Caching Enhancements: Improved caching with headers like ETag and Cache-Control.

Limitations

Head-of-Line (HOL) Blocking: Each HTTP/1.x connection could handle only one request at a time. This limitation often led to inefficient use of network resources, as subsequent requests had to wait for the previous request to complete.
Lack of prioritization: HTTP/1.x did not offer a way to prfioritize requests, which could lead to less critical resources blocking more important ones.

All of these issues have a large performance impact, especially on the modern web.

2. HTTP/2: Multiplexing for the Modern Web

HTTP/2, released in 2015, introduced major improvements in performance and efficiency. It solves the major issues of HTTP/1.x.

Key Features

Multiplexing: Solves HOL issue and allows multiple requests and responses to be sent simultaneously over a single TCP connection.
Header Compression (HPACK): Compresses headers to reduce overhead.
Stream Prioritization: Enables prioritization of critical resources for faster page loading.
Server Push: Allows servers to send resources proactively.

Limitations

HOL Blocking at TCP Level: While HTTP/2 solves HOL blocking at the application level, it still exists at the TCP layer.
Complexity: Implementing HTTP/2 requires more sophisticated server and client logic.

HTTP/2 relies on the same underlying protocol in order to operate: TCP. This is both a positive and a negative. Because TCP is used by HTTP/1.x already it means adoption is much easier; browsers don't need to implement a new underlying protocol, and servers can continue operating as they are now with a few tweaks to implement the HTTP/2 features. The downside is that there are issues with TCP, especially in high-latency and lossy networks.

3. HTTP/3: The QUIC Revolution

HTTP/3, finalized in 2022, represents a radical departure from the previous versions by using QUIC instead of TCP. QUIC is a transport protocol built on UDP, designed to reduce latency and improve resilience.

Key Features

QUIC Protocol: QUIC provides multiplexing without HOL blocking at the transport layer.
Zero Round-Trip Time (0-RTT) Resumption: Reduces latency for repeat connections.
Connection Migration: QUIC connections can seamlessly continue across network changes (e.g., switching from Wi-Fi to mobile).
Built-in encryption: QUIC incorporates Transport Layer Security (TLS) 1.3 by default, ensuring a secure connection without the need for a separate TLS handshake. This reduces latency and improves connection establishment time.
Improved congestion control: QUIC offers more advanced congestion control mechanisms, allowing it to better adapt to varying network conditions and improve overall performance.

Limitations

UDP Overhead: QUIC uses UDP, which can be blocked or throttled by some network configurations.
Adoption and Support: Requires updates to infrastructure (e.g., load balancers, firewalls) to handle QUIC.

Performance Comparison

Feature	HTTP/1.1	HTTP/2	HTTP/3
Multiplexing	❌	✅ (Application Level)	✅ (Transport Level)
Head-of-Line Blocking	✅	✅ (TCP)	❌
Compression	❌	✅ (HPACK)	✅ (QPACK)
Connection Establishment	TCP (3-RTT)	TCP (3-RTT)	QUIC (1-RTT or 0-RTT)
Encryption	Optional (TLS)	Mandatory (TLS 1.2/1.3)	Built-in (TLS 1.3 in QUIC)

Adoption Trends

HTTP/1.1: Still widely used, especially in legacy systems.
HTTP/2: Adoption is strong, supported by most modern browsers and servers. Many CDNs default to HTTP/2.
HTTP/3: Adoption is growing rapidly, led by major players like Google and Cloudflare. Browser support is robust, but infrastructure adoption is catching up.

Adoption Challenges for HTTP/3

Infrastructure Compatibility: Many middleboxes (firewalls, load balancers) need updates to handle QUIC.
UDP Blockage: Some networks block or deprioritize UDP traffic, limiting QUIC’s effectiveness.

Conclusion

Each version of HTTP addresses critical performance bottlenecks in its predecessor. While HTTP/1.1 laid the groundwork for modern web communication, HTTP/2 improved performance with multiplexing and compression. HTTP/3, with its use of QUIC, represents the future of web protocols by solving head-of-line blocking at the transport level and reducing latency.

As the web continues to evolve, HTTP/3 adoption will likely accelerate, driven by the demand for faster, more resilient connections. However, full adoption will depend on updating network infrastructure and overcoming the challenges associated with UDP-based protocols.

Navigating System Design: How Scalability, Reliability, Availability, and Performance Shape Success

Anish Ratnawat — Fri, 08 Dec 2023 18:30:00 GMT

When designing modern distributed systems, four key attributes stand out as pillars of success: Scalability, Reliability, Availability, and Performance. Together, these "Fantastic Four" guide architects and engineers to build robust systems capable of meeting diverse and demanding requirements. Let’s explore each of these in detail

1. Scalability: Designing for Growth

Scalability refers to a system's ability to handle increased load by adding resources, either horizontally (more machines) or vertically (better machines). A well-scalable system does not degrade performance.

Few Key Considerations for scalability:

Load Balancing: Distributing requests across multiple servers to avoid bottlenecks.
Stateless Services: Stateless design enables easier scaling as each request can be handled independently.
Partitioning/Sharding: Splitting data across different databases or servers.

Example:
Think of an e-commerce platform during a holiday sale. The system must scale to handle millions of requests and transactions simultaneously.

2. Reliability: Building Trust in the System

Reliability ensures that the system performs correctly under expected conditions and gracefully degrades under unexpected conditions. A reliable system minimizes failures and provides consistent results.

Techniques to Enhance Reliability:

Redundancy: Duplicating critical components to avoid single points of failure.
Failover Mechanisms: Automatically switching to backup systems during a failure.
Data Replication: Keeping multiple copies of data across different nodes or regions.

Example:
Payment gateways rely heavily on reliability. Even a minor glitch can lead to financial losses or customer dissatisfaction.

3. Availability: Ensuring Uptime

Availability is about how often a system is operational and accessible. It’s measured by uptime percentages. High availability (HA) systems aim for 99.99% uptime or better.

Strategies to Achieve High Availability:

Load Balancers and Health Checks: Continuously monitor services and route traffic to healthy nodes.
Distributed Systems: Spreading services across multiple data centers ensures availability even during regional outages.
Graceful Degradation: Allowing partial functionality when full service is not possible (e.g., read-only mode for a database).

Example:
Social media platforms prioritize availability to ensure users can access their accounts at any time, across the globe.

4. Performance: Speed and Efficiency

Performance measures how fast and efficiently a system processes requests and delivers results. Poor performance can drive users away, regardless of other attributes.

Key Performance Metrics:

Latency: Time taken to process a request.
Throughput: Number of requests processed per unit time.
Resource Utilization: CPU, memory, and network bandwidth usage.

Performance Optimization Techniques:

Caching: Storing frequently accessed data in memory to reduce response times.
Content Delivery Networks (CDNs): Distributing static content closer to users.
Asynchronous Processing: Handling non-critical tasks in the background.

Example:
Search engines like Google prioritize performance to return search results in milliseconds, enhancing user experience.

Final Thoughts

The "Fantastic Four" of system design—scalability, reliability, availability, and performance—are not just buzzwords but essential principles that drive the architecture of modern systems. Mastering these concepts empowers engineers to build systems that not only meet today’s demands but are also prepared for future challenges.

In your next project, consider these pillars as guiding stars to ensure success in the ever-evolving landscape of distributed systems.

Scaling: Horizontal vs Vertical – What You Need to Know

Anish Ratnawat — Mon, 04 Dec 2023 18:30:00 GMT

Scaling is a crucial aspect of designing systems that can handle increasing workloads. Whether you're building a distributed system, a web application, or a backend service, choosing the right scaling strategy can significantly impact performance, cost, and manageability. In this post, we'll explore two primary scaling strategies: horizontal scaling and vertical scaling, and help you decide when to use each.

What is Vertical Scaling?

Vertical scaling (also known as scaling up) involves adding more resources to a single machine. This can include:

Adding more CPU cores
Increasing RAM
Upgrading to faster storage (SSDs)

Benefits of Vertical Scaling:

Simplicity: No need to modify your application architecture.
Quick to implement: Often requires only hardware upgrades or moving to a larger instance in cloud environments.
Consistent performance: No need for load balancing or data replication.

Challenges:

Hardware limits: There’s a ceiling to how much you can scale a single machine.
Downtime risks: Upgrading a machine often requires downtime, impacting availability.
Single point of failure: The system remains dependent on one machine.

When to Use Vertical Scaling:

Applications with monolithic architectures.
Systems where downtime for upgrades is acceptable.
When simplicity is a priority and workloads are predictable.

What is Horizontal Scaling?

Horizontal scaling (also known as scaling out) involves adding more machines (nodes) to distribute the load. In cloud environments, this often means deploying more instances of your application.

Benefits of Horizontal Scaling:

Unlimited scaling potential: You can keep adding nodes as needed.
High availability: If one node fails, others can continue handling the load.
Resilience: With proper load balancing, the system can tolerate failures better.

Challenges:

Complexity: Requires changes to the application architecture to support distributed systems.
Data consistency: Maintaining data consistency across multiple nodes can be challenging.
Load balancing: You need effective strategies to distribute traffic across nodes.

When to Use Horizontal Scaling:

Systems that need to handle large-scale traffic or have unpredictable workloads.
Applications built using microservices or distributed architectures.
Scenarios where high availability is a requirement.

A Comparative Table: Horizontal vs Vertical Scaling

Feature	Vertical Scaling	Horizontal Scaling
Cost Efficiency	Expensive	More cost-effective at scale
Complexity	Low complexity	Higher complexity
Fault Tolerance	Low (single point of failure)	High (redundancy across nodes)
Downtime	Potential downtime during upgrades	Minimal downtime with new nodes
Scaling Limit	Hardware limitations	Virtually unlimited
Application Changes	Minimal	Requires architecture changes

Key Considerations When Choosing a Scaling Strategy

Workload Characteristics:
If your application has bursty traffic, horizontal scaling can handle spikes better with load balancing.
Budget Constraints:
Vertical scaling might be suitable for smaller applications where the cost of multiple nodes is prohibitive.
Architecture Design:
Microservices and stateless applications thrive with horizontal scaling, while monolithic apps often require vertical scaling.
Cloud Provider Features:
Cloud platforms like AWS, Azure, and GCP offer auto-scaling groups, making horizontal scaling more accessible.

Real-World Examples

Vertical Scaling:
A relational database like PostgreSQL on a single server can benefit from vertical scaling by adding more CPU and RAM.
Horizontal Scaling:
Web applications using Kubernetes can deploy additional pods to handle increased traffic, making horizontal scaling seamless.

Conclusion

Both horizontal and vertical scaling have their place in system design. While vertical scaling offers simplicity and quick upgrades, horizontal scaling provides better fault tolerance and scalability. As a software engineer, understanding your application’s needs and workload patterns is critical to making the right decision.

Do you have insights on scaling strategies? Share them in the comments!

Stateful vs Stateless Applications: Key Differences and Design Considerations

Anish Ratnawat — Tue, 13 Jul 2021 18:30:00 GMT

In the world of distributed systems and modern application architecture, understanding whether to design an application as stateful or stateless can significantly impact performance, scalability, and user experience. This blog explores these concepts, highlights their pros and cons, and provides design guidelines.

Stateful Applications

A stateful application maintains state information between requests. This means the server keeps track of the user’s interactions, storing data like session information, user preferences, and temporary data across multiple client requests.

Key Characteristics:

Session Management: State information is maintained on the server (e.g., session IDs, user data).
Resource Dependence: Requires consistent access to the same server or storage.
Failure Handling: Complex, as state restoration is necessary after crashes.

Examples:

Banking applications (maintaining session data)
Video conferencing tools (preserving connection state)

Pros:

Simplified user experience since state persistence allows continuity.
Easier to handle complex workflows where intermediate data is needed.

Cons:

Scalability challenges as servers need to retain state.
Complex failure handling, as state recovery is needed after server crashes.

Stateless Applications

Stateless applications treat each request independently. No session data is stored on the server, and each request carries all the information needed for processing.

Key Characteristics:

Independent Requests: Every request is self-contained.
Scalable Architecture: Easy to scale by adding more servers.
Fault Tolerance: No state recovery is needed if a server crashes.

Use Cases:

RESTful APIs and microservices.
Content delivery networks (CDNs).
Serverless computing platforms.

Pros and Cons:

Pros	Cons
Easy horizontal scaling.	Repetitive data transmission.
Simplified fault tolerance.	Complex workflows need extra handling.

Stateful vs Stateless: Side-by-Side Comparison

Feature	Stateful Applications	Stateless Applications
State Management	Maintains state across sessions.	No state is maintained.
Scalability	Challenging due to session affinity.	Easy horizontal scaling.
Failure Handling	Requires state restoration.	No state recovery needed.
Performance	Can be slower due to overhead.	Generally faster and simpler.

Design Considerations

When choosing between stateful and stateless architecture, consider the following factors:

Scalability Needs:
Stateless systems are ideal for applications that require scaling across multiple servers.
User Experience:
Stateful applications provide smoother, continuous experiences, which are essential for certain workflows.
Failure Handling:
Stateless applications are easier to manage in case of server failures.
Resource Management:
Stateful applications may require more resources to manage sessions and state.

Hybrid Approach:

Many modern systems adopt a hybrid approach, where critical user interactions are stateful (e.g., login sessions), while the rest of the interactions remain stateless.

Conclusion

Choosing between stateful and stateless architectures depends on your application's requirements. Stateless systems offer simplicity and scalability, making them a popular choice for cloud-native applications. However, stateful systems are essential for delivering rich, interactive user experiences where context matters. Understanding these trade-offs is key to designing robust, scalable, and efficient distributed systems.

Tip for Software Engineers: For large-scale systems, consider using stateless microservices with a stateful data store to balance performance and user experience.

Are you working on designing stateful or stateless systems? Share your thoughts and experiences in the comments below!

Long-Polling vs WebSockets vs Server-Sent Events

Anish Ratnawat — Mon, 20 May 2019 18:30:00 GMT

Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request:

Client opens a connection and requests data from the server.
The server calculates the response.
The server sends the response back to the client on the opened request.

Ajax Polling

Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned.

Client opens a connection and requests data from the server using regular HTTP.
The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
The server calculates the response and sends it back, just like regular HTTP traffic.
Client repeats the above three steps periodically to get updates from the server.

Problem with Polling is that the client has to keep asking the server for any new data. As a result, a lot of responses are empty creating HTTP overhead.

HTTP Long-Polling

A variation of the traditional polling technique that allows the server to push information to a client, whenever the data is available. With Long-Polling, the client requests information from the server exactly as in normal polling, but with the expectation that the server may not respond immediately. That’s why this technique is sometimes referred to as a “Hanging GET”.

If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available.
Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event.

The basic life cycle of an application using HTTP Long-Polling is as follows:

The client makes an initial request using regular HTTP and then waits for a response.
The server delays its response until an update is available, or until a timeout has occurred.
When an update is available, the server sends a full response to the client.
The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period.
Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed, due to timeouts.

WebSockets

WebSocket provides Full duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server that both parties can use to start sending data at any time. The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server. This is made possible by providing a standardized way for the server to send content to the browser without being asked by the client, and allowing for messages to be passed back and forth while keeping the connection open. In this way, a two-way (bi-directional) ongoing conversation can take place between a client and a server.

Server-Sent Events (SSEs)

Under SSEs the client establishes a persistent and long-term connection with the server. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so.

Client requests data from a server using regular HTTP.
The requested webpage opens a connection to the server.
The server sends the data to the client whenever there’s new information available.

SSEs are best when we need real-time traffic from the server to the client or if the server is generating data in a loop and will be sending multiple events to the client.

Conclusion

Long-Polling, WebSockets, and Server-Sent Events each offer unique advantages and are suited to different use cases in client-server communication.

Long-Polling is a more efficient version of traditional polling, reducing unnecessary server requests by waiting for data to become available.

WebSockets provide a full-duplex communication channel, allowing for real-time, bidirectional data exchange with minimal overhead, making them ideal for applications requiring constant interaction, such as chat applications or live updates.

Server-Sent Events are optimal for scenarios where the server needs to push updates to the client, such as live news feeds or stock price updates, but do not require client-to-server communication.

Understanding the strengths and limitations of each protocol can help developers choose the most appropriate solution for their specific application needs.

Anish Ratnawat's Tech Blog

Model Context Protocol (MCP) -- Overview & Performance Benchmarks

What is MCP?

Core Capabilities

MCP Architecture

MCP Transport Modes

1. stdio (Local Only)

2. SSE -- HTTP + Server-Sent Events (Remote / Legacy)

3. Streamable HTTP (Recommended)

Performance Benchmarks

Test Overview

Benchmark Tools Used

Latency & Throughput

Resource Utilization

Tool-Specific Latency (ms)

Key Findings

Production Recommendations

Go -- Cloud-Native & Cost-Optimized

Java -- Lowest Latency & Mature Ecosystem

Node.js -- Moderate Traffic (<500 RPS)

Python -- Dev / Test / Low Traffic

Conclusion

Agent Harness: The Infrastructure Layer That Makes AI Actually Work

Table of Contents

1. The AI Reliability Problem

2. What Is an Agent Harness?

3. Harness vs. Orchestrator — What's the Difference?

4. The 5 Core Components of a Good Harness

Component 1: Human-in-the-loop controls

Component 2: Context and memory management

Component 3: Tool call orchestration

Component 4: Sub-agent coordination

Component 5: Prompt preset management

5. Advanced Pattern: Persistent Memory

The three layers of memory

How it works end-to-end

Tradeoffs

6. Advanced Pattern: Bug Knowledge Base

The data model

The four capture points

Choosing your storage backend — four tiers

Tier 1 — Markdown files in the repo ✅ Recommended to start

Tier 2 — Markdown source + SQLite FTS5 index ✅ Recommended at scale

Tier 3 — JSONL flat file (Agent-heavy teams)

Tier 4 — Vector database (1000+ bugs, semantic search)

Decision guide

7. Real-World Examples from Production

Anthropic — Claude Code: the three-agent harness

Manus: 5 harness rewrites, same model, 5× better reliability

Microsoft — Azure SRE Agent: 40.5 hours to 3 minutes

Vercel: subtraction as harness improvement

8. Harness Engineering as a Discipline

9. Why the Harness Is the Moat, Not the Model

10. The Future: Self-Optimizing Harnesses

11. Conclusion

Consistent Hashing: Explained with Implementation Steps

What is Consistent Hashing?

Traditional Hashing:

Limitations of Traditional Hashing:

How Consistent Hashing is Better:

Example

Implementation Details

Benefits of Consistent Hashing

Comparison with Other Hashing Techniques

Applications of Consistent Hashing

Conclusion

Exploring Retrieval Augmented Generation (RAG) with Vector Databases and AI Agents

What is Retrieval Augmented Generation (RAG)?

How Does RAG Work?

The Role of Vector Databases in RAG

Vector Representation of Texts

A Practical Example of RAG with Vector Databases and AI Agents

Scenario: An AI-powered Virtual Assistant for Technical Support

Implementation

Exploring AI Agents: Step-by-Step Implementation Insights

How AI Agents Differ from Traditional LLMs

How to Develop Custom AI Agents

Coding Example Using LangChain

Use Cases of AI Agents

Example: AI Agent for Customer Support in Online Ticket Booking