Agents and Tool Use — Building Autonomous AI Systems
LLMs that plan, use tools, and execute multi-step tasks autonomously. ReAct, tool calling, memory, and the architecture patterns behind production AI agents.
Module 54 built a toy ReAct agent with text parsing. This module builds production agents — structured tool calling, persistent memory, failure recovery, and the architectural patterns Indian tech teams actually ship.
The gap between a demo agent and a production agent is enormous. A demo agent works when everything goes right. A production agent handles tool failures gracefully, detects when it is stuck in a loop, maintains context across sessions, asks for clarification instead of hallucinating, and refuses irreversible actions without confirmation. These are not edge cases — they are the majority of real interactions.
Razorpay's internal dispute resolution agent handles merchant queries that span 8–12 tool calls: look up transaction, check dispute status, retrieve relevant policy, draft response, validate response, send email, update CRM, close ticket. Any step can fail. Any step can return unexpected data. The agent must handle all of this without a human in the loop on every call. Getting this right is an engineering problem as much as an ML problem.
A junior employee (chatbot) answers questions. A senior employee (basic agent) uses tools when needed. A reliable professional (production agent) uses tools correctly, handles failures without panicking, escalates when genuinely stuck, keeps records of what they did and why, and never sends an important email without double-checking the draft. The gap between junior and reliable professional is not knowledge — it is judgment, error handling, and knowing when to stop and ask.
Production agents are not smarter LLMs. They are better-engineered systems around the same LLMs. The reliability comes from the scaffolding — structured tool schemas, retry logic, loop detection, confirmation gates, and comprehensive logging.
Structured tool calling — JSON schemas, not text parsing
Module 54 parsed tool calls by extracting text between "Action:" and "(" with regex. This breaks constantly — the LLM formats output slightly differently each run, adds punctuation, or skips the format entirely. Structured tool calling solves this: you define tools as JSON schemas, the API enforces that the LLM returns a structured tool_call object, and you execute the corresponding function with validated arguments. No regex. No parsing.
Production agent — loop detection, failure recovery, confirmation gates
Three memory types — conversation, episodic, and semantic
A stateless agent forgets every conversation the moment it ends. Production agents need three types of memory working together. Conversation memory is the message history within the current session — the agent knows what was said earlier in this conversation. Episodic memory stores summaries of past sessions — the agent knows this merchant called last week about the same issue. Semantic memory is the knowledge base (RAG from Module 67) — the agent knows Razorpay's policies and documentation.
Task decomposition and planning — breaking multi-step work into reliable steps
Simple queries need one tool call. Complex tasks — "process this batch of 50 dispute emails and resolve what you can, escalate the rest" — need a plan. Planning separates the reasoning about what to do from the execution of doing it. A planner LLM call generates the sequence of steps. Each step is then executed independently with its own error handling. This separation makes complex tasks more reliable because each step can be retried or skipped without re-planning the entire task.
Observability, rate limiting, and graceful degradation
Every common production agent mistake — explained and fixed
The Generative AI section is complete. Section 11 — MLOps and Production — begins next.
You have now covered the full generative AI landscape across 9 modules: what generative AI is, GANs, VAEs, diffusion models, LLM pretraining and RLHF, LLM fine-tuning, multimodal models, advanced RAG, and production agents. Each module built on the last. Section 11 shifts from building models to shipping them — ML pipelines, experiment tracking, model deployment, monitoring, and the full MLOps lifecycle that keeps production models healthy over time.
Feature pipelines, training pipelines, inference pipelines. Feast and Tecton for feature stores. Airflow, Kubeflow, and Prefect for orchestration.
🎯 Key Takeaways
- ✓Production agents differ from demo agents in error handling, not capability. The gap is: loop detection (hash every tool call, break on repetition), confirmation gates (irreversible actions must pause for explicit human approval), retry logic with backoff, hard max_calls enforcement, and comprehensive logging of every decision for debugging.
- ✓Structured tool calling via JSON schemas eliminates text parsing failures. Define tools as OpenAI-compatible function schemas — the API returns structured tool_call objects with validated arguments. Never parse tool calls from LLM text output. The tool description must be precise: what the tool does, what arguments it needs, and whether it is irreversible.
- ✓Three memory types work together: conversation buffer (deque of recent messages, compressed to summary when full), episodic memory (per-user summaries of past sessions stored in Redis/Postgres, injected into system prompt), semantic memory (RAG knowledge base, retrieved per query). Each addresses a different temporal scale of context.
- ✓Task planning separates reasoning from execution. A planner LLM call generates a dependency graph of steps. Each step executes independently with its own retry logic. Failed steps mark dependent steps as skipped. This structure makes complex multi-step tasks debuggable — you can see exactly which step failed and why, and retry it without re-planning.
- ✓Four production infrastructure requirements: logging (every tool call is an audit trail with inputs, outputs, and latency), rate limiting (prevent runaway costs from looping agents), caching (identical tool calls within a session hit the database once), and metrics (success rate, tool failure rate, loop detection rate, latency percentiles).
- ✓Latency management: use a fast model for tool selection (Groq LLaMA-3: 300ms), reserve slower models for final generation only. Cache tool results across turns. Stream final answers token by token. Show progress indicators during multi-step execution. Target first visible output under 2 seconds even when full resolution takes 10+ seconds.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.