Notes on Building Agentic Workflows (Andrew Ng)

I. Introduction to Agentic Workflows

1.1 What Is Agentic AI

Core definition: An LLM-based application that completes tasks through multi-step execution flows.

Non-agentic: Single prompt, one-shot completion (like writing an essay without a backspace key)
Agentic: Multi-step flow — outline → decide if research is needed → execute searches → write draft → reflect and revise → final output

Analogy: Making a stir-fry dish with multiple AIs each handling a role (prep / cooking / plating / review)

The blue labels mark different stages of AI evolution: from prompt engineer to content engineer to Hermes engineer (Agent).

1.2 Levels of Autonomy

Agentic is an adjective, not a noun — this sidesteps the debate over “what really counts as an Agent.”

Low Autonomy	High Autonomy
All steps predefined, tool calls hardcoded	Agent dynamically decides the step sequence
AI only generates text	Agent can create new tools on its own
“Obedient but brainless assistant”	“Smart, accountable intern”

The essence: not just “can do work,” but “knows how to think about the work, what tools to use, and can self-correct.”
A proper Agentic AI should be capable of autonomous planning (Planning — selecting tools on its own) and autonomous reflection (Reflection — with memory and self-review).

1.3 Three Major Benefits

Significant performance gains: On HumanEval, an agentic GPT-3.5 can outperform a non-agentic GPT-4 (though both sound like ancient history now)
Parallel speedup: Multiple LLM instances search and read simultaneously, then aggregate results
Modular design: Freely swap components (search engines, different LLMs, different tools)

1.4 Task Decomposition Method

Core methodology:

Observe human behavior → 2. Break into sub-steps → 3. Assess LLM/tool feasibility → 4. Iterate and improve

Case study — progressive decomposition of article writing:

1 step: Generate directly (shallow)
3 steps: Outline → Search → Write (better, but may feel disconnected)
5 steps: Outline → Search → Draft → Self-critique → Revise (best — simulates the human write-reflect-revise loop)
Core principle: “If a step produces poor results, break it down into smaller sub-steps.”

Building blocks: Model (LLM) + Tools (APIs, information retrieval, code execution)

1.5 Agentic AI Evaluation (Evals — think before you build)

Andrew Ng emphasizes: The biggest predictor distinguishing effective from ineffective practitioners is a rigorous development process built around evaluation and error analysis.

Methodology:

Build first, observe outputs, then evaluate (don’t design all metrics upfront): just see how things work.
Identify low-quality outputs and define error types.
Build evaluation metrics to track errors: write scripts to automatically scan all agent outputs and count how often error outputs appear.
For subjective criteria, use LLM as Judge (1–5 scoring, but don’t let the model score directly without guidance).

Two types of evaluation: End-to-end evaluation (overall output quality) / Component-level evaluation (per-step quality)

1.6 Overview of the Four Design Patterns

Pattern	Core Idea
Reflection	Multiple agents check, evaluate, and improve their own outputs.
Tool Use	Gives LLMs the ability to call external tools/functions
Planning	The model autonomously decides the sequence of steps for complex tasks
Multi-Agent	Multiple agents with different specializations collaborate

Real-World Example: oh-my-claudecode (OMC)

OMC (oh-my-claudecode) is a real Agent system that perfectly maps to these four design patterns:

Reflection: code-reviewer / verifier agent — after the executor writes code, an independent reviewer examines it. This is exactly what section 2.1 describes: “use different models, one to generate and one to review (1+1>2).” OMC’s rule is Never self-approve in the same active context — writing code and reviewing code must always be two separate agents.
Tool Use: MCP servers (Context7 for docs, filesystem for file ops, LSP for code analysis), the Skill system (/commit, /plan, and other callable skills), Bash tools — matching section 3.1’s “tools are functions, the model autonomously decides when to use them.” Claude independently judges which tool to use for each task.
Planning: planner agent, /plan skill, plan mode — matching section 5.2’s “LLM outputs a structured plan before executing.” In plan mode, Claude first explores the codebase and designs an approach, only writing code after you approve.
Multi-Agent: Team mode (/team) can launch multiple specialized agents simultaneously (explorer for search, executor for writing code, reviewer for review, designer for UI…), sharing a TaskList and collaborating via SendMessage.

II. The Reflection Design Pattern

使用反射模式真的是能提升效果的 — Reflection really does improve output quality. So maybe taking more time to reflect on yourself actually leads to growth. 🐶

2.1 Reflection Improves Task Output

Core analogy: Humans review and revise drafts — AI can do the same.

Email writing example:

AI generates V1 → Feed V1 back to the LLM with a reflection prompt → Generate improved V2

Progressive path for code writing:

Basic: LLM writes code V1 → LLM reviews and generates V2
Advanced: Use different models — one to generate, one reasoning model to review (1+1>2)
Ultimate: Combine external feedback — execute V1 in a sandbox, capture errors, feed back to LLM to generate V2

Key insight: Reflection is an engineering practice, not magic; external feedback is the critical differentiator

2.2 Internal — Two Golden Rules of Self-Reflection

Explicitly instruct the reflection action: Say “review,” “check,” “verify” (specific actions), not just “improve.” For objective tasks: build ground-truth datasets + automated code evaluation (e.g., checking SQL query correctness).
Specify concrete criteria: List explicit evaluation dimensions (e.g., “professional tone,” “factually accurate”). For subjective tasks: use a Rubric to guide LLM scoring — avoid direct comparison (which introduces positional bias).

The paper “Self-Refine” shows that reflection consistently improved performance across all 7 tasks and 4 models tested.

2.3 External — Breaking Through the Performance Ceiling

Three performance curves:

🔴 No reflection: rapid gains from prompt engineering, then plateaus
🔵 With reflection: breaks through the plateau to a higher level
🟡 Reflection + external feedback: breaks through again to the highest level

Three types of external feedback: regex matching (avoid mentioning competitors) → search validation (fact-checking) → word count checks (format constraints)

External feedback breaks the model out of its information silo, addressing inherent weaknesses (precise counting, fact verification) and enabling closed-loop optimization.

截屏2026-04-09 03.33.25 — Performance curve comparison: reflection vs. external feedback

III. Tool Use

3.1 What Tools Actually Are

Tools are functions — the model autonomously decides when to use them.

Key capability — conditional invocation: the model intelligently judges when a tool is needed.

“How much caffeine is in green tea?” → Answer from internal knowledge
“What time is it now?” → Call the get_current_time tool

Multi-tool chaining: a calendar assistant can chain check_calendar → make_appointment

3.2 How Does an LLM Actually “Call” a Function?

In theory, an LLM never touches the execution layer of any function — from start to finish, it does exactly one thing: generate text.
What we call “calling a function” is fundamentally a text relay protocol.

sequenceDiagram
    participant U as User
    participant S as System / Engineering Code
    participant L as LLM

    U->>L: "What time is it?"

    Note over S,L: Step 1: System prompt tells LLM which tools are available
    Note right of L: System prompt includes:<br/>function name: get_current_time<br/>description: get the current time<br/>parameters: none

    Note over S,L: Step 2: LLM outputs structured text (nothing is executed!)
    L-->>S: tool_calls: [{<br/>  name: "get_current_time",<br/>  arguments: {}<br/>}]

    Note over S: Step 3: Outer code intercepts<br/>actually executes the function<br/>gets result "15:20:45"

    Note over S,L: Step 4: Result is fed back into the LLM context
    S->>L: { role: "tool",<br/>  content: "15:20:45" }

    L-->>S: "It is currently 3:20 PM."
    S->>U: "It is currently 3:20 PM."

Three key insights:

The LLM is not “calling” a function — it is predicting “the next output should be a JSON snippet expressing that I want to use this tool.” This is a pattern learned from training on large amounts of code and API documentation.
The outer engineering code is what actually executes the function — frameworks like Claude Code, the OpenAI SDK, and LangChain parse the LLM’s structured text output, execute the function, and feed the result back.
The LLM’s core ability is “judgment” — it decides “should I answer this user’s question from internal knowledge, or should I call a tool?” The “green tea caffeine” (internal knowledge) vs. “what time is it” (call a tool) example in section 3.1 illustrates exactly this judgment.

Early on, hand-written prompt templates were needed to trigger tool calls (e.g., “FUNCTION: get_current_time()”). Modern LLMs natively understand tool calling without hardcoded trigger syntax.

3.3 Tool Syntax and AI SDK (Writing Functions for LLMs to Call)

The AI SDK (from Andrew Ng’s team) unifies access to multiple LLM providers:

Function name → Python function name
Description → docstring
Parameter types → automatically extracted

3.4 Models That Write Their Own Code (LLM Writes and Calls Its Own Code)

Traditional approach (predefined add/subtract, etc.) vs. code execution (let the model write code itself):

Model outputs code inside <execute_python> tags
Code is extracted and executed in a sandbox
Error messages are fed back to the model for reflection and revision

⚠️ Security warning: A real-world case — an agentic code executor ran rm *.py and deleted all project files. Sandbox environments (Docker, E2B) are mandatory.

3.5 MCP (Model Context Protocol)

MCP standardizes how LLMs access external tools and data sources, expanding the “tool surface” available to the LLM.

Problem: m applications × n tools = m×n amount of work
MCP solution: Build n shared MCP Servers, m applications connect → work reduced to m+n
Client: The application that needs tools (Cursor, Claude Desktop, etc.)
Server: The tool/data provider (Slack, GitHub, PostgreSQL, etc.)

IV. Practical Tips for Building Agentic AI

4.1 Evals in Practice

Evaluation approaches can be divided along two dimensions, forming a 2×2 matrix to guide evaluation design:

Evaluation Dimension	Objective Evals (check with code)	Subjective Evals (use LLM as judge)
Each question has a unique correct answer (Per-Example Ground Truth)	Case 1: Invoice date extraction (each invoice has a different correct date — use code to check for a match)	Case 3: Counting gold-standard points (each topic has different key ideas — use LLM to check coverage)
Only unified rules / format / standards, no fixed answer (No Per-Example Ground Truth)	Case 2: Marketing copy length (all headlines must be 10 words — use code to check compliance)	Rubric Grading (e.g., evaluate charts against a unified clarity rubric)

Start fast and rough: Don’t be intimidated into treating evals as a massive project or spend endless time on theory first. Start with 10–20 examples and get some quick metrics to complement manual observation.
Iterate on your evals:
1. As the system and evals mature, scale up the evaluation set.
2. If the system improves but eval scores don’t go up, it’s time to improve the evals themselves.
Take inspiration from expert behavior: For systems automating human tasks, observe where the system underperforms human experts and use that as the focus for the next phase of work.

4.2 Error Analysis and Prioritization

As system complexity grows, intuition-driven debugging becomes unreliable — systematic analysis is required.

Core method:

Inspect traces and intermediate outputs: Each step’s output is called a “span”; combined, they form a “trace.”
Focus on error cases and quantify them: Build a table tracking failure rates per component.
- Example: 45% unsatisfactory search results vs. 5% poor search keyword generation → prioritize improving the search component.
- Make a habit of regularly reading the conversation log between the LLM and its tools.

4.3 Component-Level Evaluation

Analogous to unit tests vs. integration tests. Advantages: faster iteration, cleaner signal, teams can work in parallel.

Workflow: Error analysis pinpoints the problem component → Component-level eval for tuning → End-to-end eval to validate overall improvement

4.4 Strategies for Problem-Solving

Non-LLM components: Tune parameters/hyperparameters (number of search results, RAG similarity threshold), switch vendors.

LLM components (by priority):

Improve the prompt (clearer instructions, few-shot examples)
Try different LLMs (use evals to test multiple models)
Task decomposition (break complex steps into generate + reflect)
Fine-tuning (last resort — highest cost)

4.5 Latency and Cost Optimization

For early-stage teams, output quality matters far more than latency and cost. Optimize quality first, then latency, then cost.
Apply the same modular thinking: first identify which component is slowest/most expensive, then optimize it specifically (e.g., refine the prompt, switch models, reduce call frequency).

4.6 Four Phases of the Development Process

Phase	Focus	Analysis Activity
1. Rapid Prototype	Get the end-to-end flow working (“build the garbage first”)	Manually inspect outputs, read traces
2. Initial Evaluation	Go beyond manual observation	Build 10–20 example end-to-end evals
3. Rigorous Analysis	Need precise improvement direction	Error analysis, quantify component failure rates
4. Efficient Tuning	System is mature, improve at component level	Component-level evals

Two main developer activities: building (writing code) and analyzing (deciding where to focus). Teams typically spend too much time building and too little time analyzing.

V. Patterns for Highly Autonomous Agents

5.1 Planning Workflows

Planning pattern: The agent autonomously decides the tool-call sequence — no hardcoding.

Case study — customer service assistant (tools: get description, get price, check inventory, check orders, process purchase, process returns):

User asks: “Do you have round sunglasses under $100?”
LLM plans: get description → check inventory → get price → output answer

Advantages: Rich capabilities, no need to pre-orchestrate. Risks: The LLM’s plan is unpredictable and may be unstable.

5.2 Structured Plans

Natural language plans are ambiguous → require the LLM to output a structured plan (JSON/XML):

[
  {"description": "Find round sunglasses", "tool": "get_item_descriptions", "arguments": {"query": "round sunglasses"}},
  {"description": "Check inventory", "tool": "check_inventory", "arguments": {"items": "$step1_result"}}
]

5.3 Code As Action

截屏2026-04-12 03.48.14 — Code As Action — HuggingFace smolagents

Drawing from the CodeAgent concept in HuggingFace smolagents — letting the LLM write code directly to express multi-step plans.

Advantages: Can call large libraries (hundreds of Pandas functions), highly expressive, research shows better performance than JSON/text plans.
Risks: The code the LLM writes must be executed in a sandbox environment.

5.4 Multi-Agent Workflows

Even when all agents use the same LLM, splitting complex tasks into independent roles is more effective.

My personal intuition is that it works because different prompts/contexts cause the model to focus on different things.

Advantages:

Task decomposition: Natural division of work by role/skill
Focus: Developers build one role at a time; simpler tasks = better output
Modular reuse: General-purpose agents (e.g., “chart designer”) can be reused across applications
Bypasses context limits: Each agent handles its own context (critical for 128k context constraints)
Cost savings: Shorter contexts = fewer tokens = lower cost and faster response

5.5 Four Communication Patterns

Pattern	Structure	Pros	Cons	Best For
Linear	Sequential, one-directional	Simple	Inflexible	Fixed-flow tasks
Hierarchy (two-tier)	Manager coordinates all subordinates	Easy to control	Manager bottleneck	Multi-task coordination
Deep Hierarchy	Sub-agents have their own sub-agents	Scalable, modular	Complex, hard to debug	Large systems
All-to-All (Decentralized)	All agents communicate freely	Creative	Unpredictable results	Exploratory / generative tasks

Given current LLM capabilities, linear and hierarchical patterns are more practical (the deeper the hierarchy, the more information is lost in transmission).

Beyond these four patterns, there is also a conversation pattern — a downgraded version of the decentralized model. In conversation mode, only two agents talk to each other at a time: one executes the task, the other reviews it, and together they hand off a result both are satisfied with.

5.6 Framework Recommendations

LangChain: Linear workflows
smolagents: Hierarchical workflows (author’s recommendation — simple, low abstraction, @tool decorator makes development easy)
MetaGPT / CamelAI: Decentralized workflows

Summary and Personal Reflections

I previously built a Skill at work that used Claude Code to call MCP tools to inspect UE assets and parse build error logs (though maybe that Skill doesn’t quite qualify as Agentic AI). Thinking about it through the lens of Agentic AI, it probably could have been built much more robustly. I was also genuinely surprised by the stability of the test projects in Andrew Ng’s course.

1. Planning: After a timer fires, Claude Code first analyzes the error log, decides which tool to call (check docs, check code, check historical error records, etc.), then executes the tools — or even writes its own database query code on the fly.
1. Reflection: After receiving the tool result, Claude Code performs a self-review to determine if the result is useful. If it isn’t satisfied, it adjusts the query parameters and calls the tool again, repeating until it gets a result it’s happy with.
1. Multi-Agent: You could design multiple specialized agents — one dedicated to log analysis, one to querying docs, one to querying code — collaborating through a shared context.
1. Evals: You could design automated evaluation scripts to quantify Claude Code’s performance on resolving errors — metrics like success rate, average time to resolution, etc. (Each completed result could auto-upload a JSON record to a server for the admin to review weekly. Users could also be asked whether the AI’s suggested solution actually solved their problem, building up a solution database so the AI can reference past resolutions for similar future issues.)

One more thing: different models may suit different harnesses, since their capabilities vary (as mentioned in Hung-yi Lee’s course — for example, Sonnet has a kind of “context anxiety,” meaning its capabilities noticeably degrade when the context gets very long).

Finally, returning to the Tool Use design pattern: one key practical insight is that the design quality of MCP tools directly determines the capability ceiling of the agent. Drawing from my experience with the UE MCP project, here are six tool design principles I’ve distilled:

Description is the most important design decision — the description is the interface: The caller of an MCP tool is the LLM, not a human. The LLM relies on the description field to decide “when to call this and how to call it.” A good description includes: what it does, when to use it, boundary constraints, parameter semantics, and what the return value means. A poor description makes the tool dead weight.
Granularity control — use subsystems as boundaries: Tools that are too fine-grained (e.g., splitting node creation by coordinate axis) lead to long call chains and compounding errors; tools that are too coarse (e.g., generating an entire character blueprint in one call) become black boxes where the LLM can’t localize failures. Use engine subsystems as boundaries — each tool does one complete thing.
Return values must be “LLM-friendly”: Return values must include enough decision context — on success, indicate what operations are available next; on failure, provide error_type, error_message, and suggestion so the agent can self-correct rather than blindly retry.
Separate reads from writes, make side effects transparent: When uncertain, LLMs tend to call tools that “look safe.” Read-only tools and write operations should be clearly categorized, with write operations explicitly annotating side effects in the description (e.g., “creates a new file on disk,” “irreversible operation”).
Idempotent design — make the LLM willing to retry: The LLM may call the same tool repeatedly due to timeouts or misjudgments. Design tools to be safe to call multiple times (e.g., if the asset already exists, return the existing asset instead of throwing an error).
Layered tool structure: High-level tools (task-oriented complete workflows, e.g., setup_character_blueprint()) reduce the number of calls needed; mid-level tools (single-step operations) preserve flexibility; low-level APIs should not be directly exposed to the LLM. Guide the LLM in the description to prefer high-level paths.

The one-sentence summary: Good MCP tool design, at its core, means “designing tools so the LLM can use them correctly, as if it had read the documentation.” This perfectly aligns with the core idea of the Tool Use pattern in Andrew Ng’s course — the quality of your tools sets the upper bound on your agent’s autonomous decision-making.

References

https://www.bilibili.com/video/BV1DfrdByE2H — Course video (Bilibili)
Original GitHub notes: Contains runnable code you can study as Jupyter Notebooks in VSCode — very convenient.
Another video mentioned later in the course: Agentic Knowledge Graph Construction
Original course link
Hung-yi Lee’s course: A good companion — feels a bit like an Agent course in its own right

AI AgenticAI Framework