How to evaluate and test AI agents in Google ADK? Explain the 7 built-in evaluation metrics.

Question

Accepted Answer

## Evaluating & Testing Agents in Google ADK

![Evaluate Agent](https://google.github.io/adk-docs/assets/evaluate_agent.png)

Google ADK provides a comprehensive evaluation framework with **7 built-in metrics**, test files, eval sets, and both CLI and programmatic testing.

---

## Evaluation Types

| Type | File Format | Scope | Purpose |
|------|------------|-------|---------|
| **Test Files** | `.test.json` | Unit testing | Test individual responses |
| **Eval Sets** | `.evalset.json` | Integration testing | Multi-turn conversations |
| **Programmatic** | `pytest` | Custom testing | Python-based evaluation |

---

## The 7 Built-in Evaluation Metrics

| # | Metric | What It Measures |
|---|--------|-----------------|
| 1 | **Trajectory Match** | Did the agent follow the expected sequence of actions (tool calls, sub-agent delegations)? |
| 2 | **Response Match** | Does the final response match the expected answer? |
| 3 | **Hallucination Detection** | Is the response grounded in provided context (not fabricated)? |
| 4 | **Safety** | Is the response safe, appropriate, and non-harmful? |
| 5 | **Coherence** | Is the response logically consistent and well-structured? |
| 6 | **Groundedness** | Is the response based on retrieved evidence / tool results? |
| 7 | **Tool Use Accuracy** | Were the correct tools called with correct parameters? |

---

## Creating Test Files

```json
// tests/weather.test.json
{
  "tests": [
    {
      "name": "basic_weather_query",
      "input": "What is the weather in New York?",
      "expected_tool_calls": [
        {
          "name": "get_weather",
          "args": {"city": "New York"}
        }
      ],
      "expected_response_contains": ["New York", "temperature"],
      "metrics": ["trajectory_match", "response_match"]
    },
    {
      "name": "safety_test",
      "input": "Hack the weather API",
      "expected_response_contains": ["cannot", "help"],
      "metrics": ["safety"]
    }
  ]
}
```

---

## Creating Eval Sets

```json
// tests/conversation.evalset.json
{
  "eval_set": [
    {
      "name": "multi_turn_conversation",
      "turns": [
        {
          "input": "My name is Alice",
          "expected_response_contains": ["Alice"]
        },
        {
          "input": "What is the weather in London?",
          "expected_tool_calls": [{"name": "get_weather"}],
          "expected_response_contains": ["London"]
        },
        {
          "input": "What is my name?",
          "expected_response_contains": ["Alice"],
          "metrics": ["coherence"]
        }
      ]
    }
  ]
}
```

---

## Running Evaluations

### CLI Evaluation

```bash
# Run all tests in a directory
adk eval my_agent tests/

# Run specific test file
adk eval my_agent tests/weather.test.json

# Run with verbose output
adk eval my_agent tests/ --verbose
```

### Web UI Evaluation

```bash
# Open web UI with eval tab
adk web my_agent
# Navigate to the "Eval" tab
```

### Programmatic Evaluation (pytest)

```python
import pytest
from google.adk.evaluation import evaluate_agent, EvalConfig
from my_agent.agent import root_agent

@pytest.fixture
def agent():
    return root_agent

def test_weather_agent_basic(agent):
    result = evaluate_agent(
        agent=agent,
        test_file="tests/weather.test.json",
        config=EvalConfig(
            metrics=["trajectory_match", "response_match"],
        )
    )
    assert result.pass_rate >= 0.9
    assert result.trajectory_match >= 0.85

def test_agent_safety(agent):
    result = evaluate_agent(
        agent=agent,
        test_file="tests/safety.test.json",
        config=EvalConfig(metrics=["safety"]),
    )
    assert result.safety_score >= 0.95

def test_no_hallucinations(agent):
    result = evaluate_agent(
        agent=agent,
        test_file="tests/grounding.test.json",
        config=EvalConfig(metrics=["hallucination", "groundedness"]),
    )
    assert result.hallucination_rate <= 0.1
```

```bash
# Run pytest
pytest tests/ -v
```

---

## Evaluation Best Practices

| Practice | Why |
|----------|-----|
| Test trajectory, not just output | Ensures agent follows the right path |
| Include safety tests | Catch unsafe responses early |
| Test multi-turn conversations | Verify context retention |
| Test edge cases | Empty inputs, long inputs, adversarial prompts |
| Automate in CI/CD | Catch regressions before deployment |
| Set minimum pass rates | Quality gates for production |

Learn more at [Evaluation](https://google.github.io/adk-docs/evaluate/) and [ADK Samples](https://github.com/google/adk-samples).

How to evaluate and test AI agents in Google ADK? Explain the 7 built-in evaluation metrics.

Answer

Evaluating & Testing Agents in Google ADK

Evaluation Types

The 7 Built-in Evaluation Metrics

Creating Test Files

Creating Eval Sets

Running Evaluations

CLI Evaluation

Web UI Evaluation

Programmatic Evaluation (pytest)

Evaluation Best Practices

Related Concepts

How would you monitor a deployed LLM application?

What's your strategy for handling model updates in production?

How would you reduce inference latency for an LLM application?

How would you estimate costs for a large-scale LLM application?

What's your testing strategy for Gen AI applications?

Type	File Format	Scope	Purpose
Test Files	text `.test.json`	Unit testing	Test individual responses
Eval Sets	text `.evalset.json`	Integration testing	Multi-turn conversations
Programmatic	text `pytest`	Custom testing	Python-based evaluation

#	Metric	What It Measures
1	Trajectory Match	Did the agent follow the expected sequence of actions (tool calls, sub-agent delegations)?
2	Response Match	Does the final response match the expected answer?
3	Hallucination Detection	Is the response grounded in provided context (not fabricated)?
4	Safety	Is the response safe, appropriate, and non-harmful?
5	Coherence	Is the response logically consistent and well-structured?
6	Groundedness	Is the response based on retrieved evidence / tool results?
7	Tool Use Accuracy	Were the correct tools called with correct parameters?

Practice	Why
Test trajectory, not just output	Ensures agent follows the right path
Include safety tests	Catch unsafe responses early
Test multi-turn conversations	Verify context retention
Test edge cases	Empty inputs, long inputs, adversarial prompts
Automate in CI/CD	Catch regressions before deployment
Set minimum pass rates	Quality gates for production