5 Hard-Earned Lessons from Working with AI Agents (Developers Must Know)

12 min readCong Dinh
5 Hard-Earned Lessons from Working with AI Agents (Developers Must Know)

5 Hard-Earned Lessons from Working with AI Agents

Over the past two years, I've journeyed from skeptic to passionate advocate of AI Agents. From automating code reviews and generating training content to building complex automation workflows - each project has taught me invaluable lessons.

This isn't a tutorial on "How to build AI Agents." Instead, these are 5 battle-tested insights I wish I knew before starting. If you're a developer exploring or preparing to work with AI Agents, these lessons will help you avoid many pitfalls and save dozens of debugging hours.

Lesson 1: AI Agents Aren't as Smart as You Think (And That's Good)

Expectations vs Reality

When I first started, I had unrealistic expectations: "The AI Agent will understand context automatically, make the right decisions, and complete tasks perfectly like a senior developer."

Reality:

  • AI Agents excel at well-structured, repetitive tasks
  • They cannot reason when context is missing
  • They will "hallucinate" when uncertain
  • Output quality depends 80% on how you design prompts and workflows

Case Study: Code Review Agent Gone Wrong

I once built a Code Review Agent to automatically review pull requests. Initially, I thought providing the diff and a generic prompt would suffice:

// ❌ Too generic prompt
const prompt = `
  Review this code and provide feedback.
  Code diff: ${diff}
`;

Result? The agent generated meaningless comments like:

  • "This code looks good!"
  • "Consider adding more comments"
  • Suggested refactoring perfectly fine code

Solution: I had to redesign with specific structure:

// ✅ Structured prompt
const prompt = `
You are a senior code reviewer. Analyze this PR with these specific criteria:
 
1. **Security:** Check for SQL injection, XSS vulnerabilities, exposed secrets
2. **Performance:** Identify N+1 queries, unnecessary loops, memory leaks
3. **Maintainability:** Check naming conventions, code duplication (>3 lines)
4. **Testing:** Verify edge cases are covered
 
Code diff:
${diff}
 
Output format (JSON):
{
  "severity": "high|medium|low",
  "category": "security|performance|maintainability|testing",
  "line": <line_number>,
  "issue": "<specific issue>",
  "suggestion": "<actionable fix>",
  "example": "<code example if applicable>"
}
 
Only report issues with medium or high severity. Skip minor style suggestions.
`;

Key Takeaway

Treat AI Agents like junior developers: You need to provide specific guidance, examples, and set clear expectations. Don't expect them to "think" like seniors.

Action items:

  • ✅ Design prompts with clear structure (input format → processing steps → output format)
  • ✅ Provide examples in prompts (few-shot learning)
  • ✅ Limit scope of each task (break down instead of one giant task)
  • ✅ Validate output with rules/schemas (don't blindly trust)

Lesson 2: Context is King - Design for Explainability

Problem: The Black Box Syndrome

One of the biggest challenges working with AI Agents is lack of transparency. When an agent makes a wrong decision, you don't know:

  • What information did it "see"?
  • How did it reason?
  • Why did it choose this action over others?

I once debugged a Training Content Generator Agent: It suddenly started generating inaccurate content after a dataset update. Took 3 days to debug because there was no visibility into the reasoning process.

Solution: Design for Observability

Now, I mandate every AI Agent to log reasoning traces:

interface AgentTrace {
  taskId: string;
  timestamp: string;
  input: {
    userQuery: string;
    context: Record<string, any>;
    availableTools: string[];
  };
  reasoning: {
    step: number;
    thought: string;
    action: string;
    observation: string;
  }[];
  output: any;
  metadata: {
    tokensUsed: number;
    latency: number;
    cost: number;
  };
}
 
// Example trace
{
  "taskId": "review-pr-1234",
  "reasoning": [
    {
      "step": 1,
      "thought": "I need to analyze code diff for security issues",
      "action": "analyze_diff",
      "observation": "Found 3 potential SQL injection points"
    },
    {
      "step": 2,
      "thought": "Need to verify if there's input validation",
      "action": "check_validation",
      "observation": "No parameterized queries used"
    },
    {
      "step": 3,
      "thought": "This is high severity, need to report immediately",
      "action": "create_comment",
      "observation": "Comment created successfully"
    }
  ]
}

Key Takeaway

Explainability isn't nice-to-have, it's must-have. You can't debug what you can't see.

Action items:

  • ✅ Log entire reasoning chain (thought → action → observation)
  • ✅ Track context provided to agent (to verify information quality)
  • ✅ Implement versioning for prompts (to rollback when needed)
  • ✅ Build debugging UI to visualize agent's decision-making
  • ✅ Store conversation history to reproduce issues

Lesson 3: Start Small, Iterate Fast (MVP Mindset for AI)

The Temptation of Over-Engineering

When I started with AI Agents, I made a classic mistake: designing a super-agent that could do everything.

Example: I wanted to build a "DevOps Assistant Agent" that could:

  • Auto-deploy applications
  • Monitor infrastructure
  • Troubleshoot issues
  • Optimize costs
  • Generate reports

After 2 weeks, I had a complex codebase with 15+ tools, 200+ lines of prompt templates, and... nothing worked properly.

The MVP Approach That Worked

I reset and applied MVP mindset:

Sprint 1 (1 week): Build an agent that does one thing - Deploy a Next.js app to Vercel

  • Input: GitHub repo URL
  • Output: Deployment URL or error message
  • No fancy features, just happy path

Sprint 2 (1 week): Add error handling

  • Parse deployment errors
  • Suggest fixes based on common issues
  • Retry logic

Sprint 3 (1 week): Expand scope

  • Support multiple platforms (Vercel, Netlify, AWS)
  • Add pre-deployment validation
  • Generate deployment summary

After 3 weeks, I had an agent that actually worked and deployed 20+ production apps.

Key Takeaway

Start with the smallest useful task. An agent that does one thing well beats one that does 10 things poorly.

Action items:

  • ✅ Identify the single most valuable task an agent can automate
  • ✅ Build MVP in 1-2 weeks (max)
  • ✅ Test with real users, gather feedback
  • ✅ Iterate based on actual usage patterns (not assumptions)
  • ✅ Scale complexity gradually, not all at once

Prioritization framework:

Value = (Time saved × Frequency) / (Complexity × Risk)

Choose task with highest Value to start.

Lesson 4: Human-in-the-Loop is Must-Have, Not Nice-to-Have

The Autonomous Agent Myth

There's a common misconception: "AI Agents must be fully autonomous to be valuable."

Production reality: The most production-ready agents I've built all have human oversight at critical checkpoints.

When to Add Human Checkpoints

I apply this rule:

Full automation (no human needed):

  • ✅ Low-risk, reversible actions (e.g., format code, generate test data)
  • ✅ Read-only operations (e.g., analyze logs, generate reports)
  • ✅ Well-defined, repetitive tasks (e.g., daily standup summaries)

Human-in-the-loop (approval required):

  • ⚠️ Actions affecting production (e.g., deploy, database migrations)
  • ⚠️ Financial implications (e.g., provision cloud resources)
  • ⚠️ Customer-facing content (e.g., email responses, documentation)
  • ⚠️ Security-critical operations (e.g., access control changes)

Implementation Pattern

interface AgentAction {
  type: 'automated' | 'requires_approval';
  action: string;
  impact: 'low' | 'medium' | 'high';
  reversible: boolean;
}
 
async function executeAction(action: AgentAction) {
  if (action.type === 'requires_approval') {
    // Send notification to human
    const approval = await requestHumanApproval({
      action: action.action,
      reasoning: action.reasoning,
      estimatedImpact: action.impact,
      previewChanges: action.preview,
      deadline: '30 minutes', // Auto-reject if no response
    });
    
    if (!approval.approved) {
      await logRejection(approval.reason);
      return { status: 'rejected', reason: approval.reason };
    }
  }
  
  // Execute action
  const result = await performAction(action);
  
  // Always log, even for automated actions
  await logExecution(action, result);
  
  return result;
}

Real Example: Auto-merge PR Agent

I built an agent to auto-merge PRs after passing CI/CD. But with checkpoints:

  1. Auto-merge if:

    • ✅ All tests passed
    • ✅ Approved by 2+ reviewers
    • ✅ No conflicts
    • ✅ Changes < 100 lines
    • ✅ Doesn't touch critical files (auth, payment, database schemas)
  2. Request approval if:

    • ⚠️ Changes > 100 lines
    • ⚠️ Touches critical files
    • ⚠️ New dependencies added
    • ⚠️ Performance regression detected

Result: 70% PRs auto-merged (save time), 30% require human review (risk mitigation).

Key Takeaway

Trust but verify. Design agents with appropriate checkpoints. Full automation isn't always the goal.

Action items:

  • ✅ Classify actions by risk level (low/medium/high)
  • ✅ Implement approval workflows for high-risk actions
  • ✅ Add preview/dry-run mode (show what will happen before doing it)
  • ✅ Set timeouts for approval requests (auto-reject if no response)
  • ✅ Build rollback mechanisms for all destructive actions

Lesson 5: Cost Optimization > Feature Richness

The Hidden Cost of AI Agents

One of the biggest shocks when I moved AI Agents to production: API costs skyrocketed.

Real case study: An agent I built to generate training questions:

  • Week 1 (testing): $15
  • Week 2 (beta with 10 users): $120
  • Week 3 (rollout to 50-person team): $680 💸

Projected cost for 200-person team: $2,700/week = $140K/year.

This is when I realized: Feature creep in AI Agents = Cost creep.

Cost Optimization Strategies

1. Right-size Your Models

Not every task needs GPT-4 or Claude 3.5 Sonnet.

// ❌ Using GPT-4 for every task
const response = await openai.chat.completions.create({
  model: 'gpt-4-turbo',  // $10/1M input tokens
  messages: [{ role: 'user', content: simplePrompt }],
});
 
// ✅ Route tasks to appropriate models
function selectModel(task: Task): ModelConfig {
  if (task.requiresReasoning || task.complexity === 'high') {
    return { model: 'gpt-4-turbo', maxTokens: 4000 };
  }
  
  if (task.type === 'classification' || task.type === 'extraction') {
    return { model: 'gpt-3.5-turbo', maxTokens: 1000 };  // $0.5/1M tokens
  }
  
  if (task.type === 'simple-generation') {
    return { model: 'gpt-3.5-turbo', maxTokens: 500 };
  }
  
  return { model: 'gpt-3.5-turbo', maxTokens: 2000 };
}

Impact: Reduced cost from $680 to $180/week (73% reduction).

2. Implement Aggressive Caching

interface CacheStrategy {
  // Cache deterministic outputs
  cacheKey: string;  // hash(prompt + context)
  ttl: number;       // Time to live
  invalidateOn: string[];  // Events trigger cache clear
}
 
// Example: Cache code review results
const cacheKey = hashPrompt(diff + reviewCriteria);
const cached = await redis.get(cacheKey);
 
if (cached && !hasFileChanged(file)) {
  return JSON.parse(cached);  // $0 cost!
}
 
const result = await agent.review(diff);
await redis.setex(cacheKey, 3600, JSON.stringify(result));
return result;

Impact: 60% cache hit rate → 60% cost reduction.

3. Batch Processing

// ❌ Process one at a time
for (const item of items) {
  await agent.process(item);  // 100 API calls
}
 
// ✅ Batch processing
const batches = chunk(items, 10);  // 10 items per batch
for (const batch of batches) {
  await agent.processBatch(batch);  // 10 API calls
}

4. Set Token Limits Aggressively

// ❌ No limits
const response = await openai.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: [...],
  // Agent can generate 4000+ tokens if it wants
});
 
// ✅ Strict limits based on use case
const response = await openai.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: [...],
  max_tokens: 500,  // Enough for most outputs, prevent rambling
  temperature: 0.3, // Lower = more focused, less creative waste
});

5. Monitor and Alert

// Daily cost tracking
interface CostMetrics {
  dailySpend: number;
  costPerTask: number;
  topExpensiveAgents: Agent[];
  unusualSpikes: Alert[];
}
 
// Set budget alerts
if (metrics.dailySpend > DAILY_BUDGET * 1.2) {
  await notify.slack({
    channel: '#ai-costs',
    message: `⚠️ AI costs 20% over budget: $${metrics.dailySpend}`,
  });
  
  // Auto-throttle if critical
  if (metrics.dailySpend > DAILY_BUDGET * 1.5) {
    await throttleAgents({ rateLimit: 0.5 }); // Reduce to 50% capacity
  }
}

The Cost vs Value Framework

I apply this framework to decide whether to optimize:

ROI = (Time Saved × Hourly Rate × Users) - Monthly AI Cost

If ROI > 3x → Keep and improve
If ROI 1-3x → Optimize cost
If ROI < 1x → Shut down or pivot

Example:

  • Agent: Auto-generate unit tests
  • Time saved: 2 hours/developer/week
  • Users: 20 developers
  • Hourly rate: $50
  • Monthly AI cost: $400
ROI = (2h × $50 × 20 devs × 4 weeks) - $400
    = $8,000 - $400
    = $7,600 (19x return)

Keep it running, but still optimize to increase margin!

Key Takeaway

Measure everything. No visibility into costs = Can't optimize. Treat AI budget like cloud infrastructure budget.

Action items:

  • ✅ Track cost per task, per agent, per user
  • ✅ Set up budget alerts (daily, weekly, monthly)
  • ✅ Implement caching strategy for repetitive tasks
  • ✅ Right-size models based on task complexity
  • ✅ Review cost/value ratio monthly, kill underperforming agents

Conclusion: From Hype to Reality

AI Agents aren't a silver bullet. They won't replace developers, nor will they automatically solve all problems. But when designed correctly, they're incredibly powerful tools to:

  • ✅ Automate repetitive, well-defined tasks
  • ✅ Augment human decision-making with insights and suggestions
  • ✅ Scale expertise (one senior developer can support more teams)

5 golden principles I learned:

  1. Treat agents like junior devs - Clear instructions, examples, validation
  2. Design for explainability - You need to understand why agents do what
  3. Start small, iterate fast - MVP mindset > Big bang approach
  4. Human-in-the-loop - Trust but verify, especially for high-risk actions
  5. Optimize costs ruthlessly - Measure, cache, right-size, monitor

Next Steps

If you're considering building AI Agents:

Start with this question: "What task in my team's daily work is repetitive, well-structured, and takes the most time?"

→ That's the perfect candidate for your first AI Agent.

Resources to get started:


Share your experience: Have you worked with AI Agents? What lessons did you learn? Connect with me on LinkedIn or email congdinh2021@gmail.com to discuss!

Interested in AI/AI Agents training for your team? I provide consulting and training services on applying AI Agents to software development workflows. Contact me to discuss!


This article is part of the "AI for Developers" series. Subscribe to receive the latest posts on AI, DevOps, and Software Architecture.

Cong Dinh

Cong Dinh

Technology Consultant | Trainer | Solution Architect

With over 10 years of experience in web development and cloud architecture, I help businesses build modern and sustainable technology solutions. Expertise: Next.js, TypeScript, AWS, and Solution Architecture.

Related Posts