> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Testing agents

> Verify agent behavior with text-only tests using pytest

The `vision_agents.testing` module provides a lightweight testing layer for verifying agent behavior — tool calls, arguments, responses, and intent — without spinning up audio/video infrastructure.

<Info>
  This framework uses familiar pytest patterns. No custom test runner required.
</Info>

## Installation

The testing module is included with Vision Agents:

```sh theme={null}
uv add vision-agents
```

Configure pytest for async support in `pytest.ini`:

```ini theme={null}
[pytest]
asyncio_mode = auto
```

## Core concepts

| Component                    | Purpose                                                  |
| ---------------------------- | -------------------------------------------------------- |
| `TestSession`                | Async context manager that wraps an LLM for testing      |
| `TestResponse`               | Result of a conversation turn with events and assertions |
| `LLMJudge`                   | Evaluates agent responses against target intents         |
| `TestSession.mock_functions` | Wraps tools in `AsyncMock` for call tracking             |

## Basic usage

### Testing a greeting

```python theme={null}
from vision_agents.plugins import gemini
from vision_agents.testing import LLMJudge, TestSession

async def test_greeting():
    llm = gemini.LLM()
    judge = LLMJudge(gemini.LLM())

    async with TestSession(llm=llm, instructions="Be friendly") as session:
        response = await session.simple_response("Hello")

        # Verify no tools were called
        assert response.function_calls == []

        # Judge the response intent
        verdict = await judge.evaluate(
            response.chat_messages[0],
            intent="Friendly greeting"
        )
        assert verdict.success, verdict.reason
```

### Testing tool calls

```python theme={null}
async def test_weather():
    llm = gemini.LLM()
    judge = LLMJudge(gemini.LLM())

    @llm.register_function(description="Get weather for a location")
    async def get_weather(location: str) -> dict:
        return {"temp": 72, "condition": "sunny"}

    async with TestSession(llm=llm, instructions="You can check weather") as session:
        response = await session.simple_response("Weather in Tokyo?")

        # Assert the tool was called with expected arguments
        response.assert_function_called("get_weather", arguments={"location": "Tokyo"})

        # Judge the response
        verdict = await judge.evaluate(
            response.chat_messages[0],
            intent="Reports weather for Tokyo"
        )
        assert verdict.success, verdict.reason
```

## TestResponse assertions

`TestResponse` provides built-in assertion methods:

### assert\_function\_called

Verifies a tool was called with expected arguments (partial match):

```python theme={null}
# Check function was called with specific argument
response.assert_function_called("get_weather", arguments={"location": "Tokyo"})

# Check function was called (any arguments)
response.assert_function_called("get_weather")

# Check any function was called
response.assert_function_called()
```

### assert\_function\_output

Verifies tool output:

```python theme={null}
# Check exact output
response.assert_function_output("get_weather", output={"temp": 72, "condition": "sunny"})

# Check if output was an error
response.assert_function_output("get_weather", is_error=True)
```

### Accessing events directly

```python theme={null}
# Pre-computed lists for inspection
response.function_calls  # List of FunctionCallEvent
response.chat_messages   # List of ChatMessageEvent
response.events          # All events in order
response.output          # Final assistant message text
response.duration_ms     # Response time in milliseconds
```

## Mocking LLM functions

### mock\_functions

Use `TestSession.mock_functions` to wrap functions into `AsyncMock` for call tracking with standard `unittest.mock` assertions:

```python theme={null}
async def test_with_mock_functions():
    llm = gemini.LLM()

    @llm.register_function(description="Get weather")
    async def get_weather(location: str) -> dict:
        return {"temp": 72}

    async def fake_weather(**_) -> dict:
        return {"temp": 55, "condition": "rainy"}

    async with TestSession(llm=llm, instructions="...") as session:
        with session.mock_functions(
            {"get_weather": fake_weather}
        ) as mocked:
            response = await session.simple_response("Weather in Berlin?")

            # unittest.mock assertions
            mocked["get_weather"].assert_called_once()
            mocked["get_weather"].assert_called_with(location="Berlin")

            # TestResponse assertion
            response.assert_function_output(
                "get_weather",
                output={"temp": 55, "condition": "rainy"}
            )
```

## LLM-as-judge

`LLMJudge` uses a separate LLM instance to evaluate whether agent responses match target intents:

```python theme={null}
from vision_agents.testing import LLMJudge, JudgeVerdict

# Use a separate LLM instance for judging
judge = LLMJudge(gemini.LLM())

# Evaluate a response
verdict: JudgeVerdict = await judge.evaluate(
    response.chat_messages[0],
    intent="Provides a helpful, accurate weather report"
)

if verdict.success:
    print(f"Passed: {verdict.reason}")
else:
    print(f"Failed: {verdict.reason}")
```

<Tip>
  Use a separate LLM instance for the judge to avoid polluting the agent's conversation history.
</Tip>

## Event types

The framework captures three event types during a conversation turn:

| Event                     | Description               | Fields                                            |
| ------------------------- | ------------------------- | ------------------------------------------------- |
| `ChatMessageEvent`        | Assistant or user message | `role`, `content`                                 |
| `FunctionCallEvent`       | Tool invocation request   | `name`, `arguments`, `tool_call_id`               |
| `FunctionCallOutputEvent` | Tool execution result     | `name`, `output`, `is_error`, `execution_time_ms` |

## Complete example

```python theme={null}
import os
import pytest
from vision_agents.plugins import gemini
from vision_agents.testing import LLMJudge, TestSession

MODEL = "gemini-flash-lite-latest"
INSTRUCTIONS = """You are a helpful assistant.
You can check the weather using the get_weather tool."""


def setup_llm(model: str):
    llm = gemini.LLM(model)

    @llm.register_function(description="Get weather for a location")
    async def get_weather(location: str) -> dict:
        return {"temp_f": 72, "condition": "sunny"}

    return llm


@pytest.mark.integration
async def test_greeting():
    """Agent gives a friendly, short greeting."""
    llm = setup_llm(MODEL)
    judge = LLMJudge(gemini.LLM(MODEL))

    async with TestSession(llm=llm, instructions=INSTRUCTIONS) as session:
        response = await session.simple_response("Hey there!")
        assert response.function_calls == []
        verdict = await judge.evaluate(
            response.chat_messages[0],
            intent="Friendly, short greeting"
        )
        assert verdict.success, verdict.reason


@pytest.mark.integration
async def test_weather_tool_call():
    """Agent calls get_weather with the right location."""
    llm = setup_llm(MODEL)
    judge = LLMJudge(gemini.LLM(MODEL))

    async with TestSession(llm=llm, instructions=INSTRUCTIONS) as session:
        response = await session.simple_response("What's the weather in Berlin?")
        response.assert_function_called("get_weather", arguments={"location": "Berlin"})
        verdict = await judge.evaluate(
            response.chat_messages[0],
            intent="Reports current weather for Berlin"
        )
        assert verdict.success, verdict.reason


@pytest.mark.integration
async def test_weather_mocked():
    """Verify tool calls with mocked implementation."""
    llm = setup_llm(MODEL)
    judge = LLMJudge(gemini.LLM(MODEL))

    async def fake_weather(**_) -> dict:
        return {"temp_f": 55, "condition": "rainy"}

    async with TestSession(llm=llm, instructions=INSTRUCTIONS) as session:
        with session.mock_functions(
            {"get_weather": fake_weather}
        ) as mocked:
            response = await session.simple_response("Weather in Berlin?")

            mocked["get_weather"].assert_called_once()
            mocked["get_weather"].assert_called_with(location="Berlin")

            response.assert_function_output(
                "get_weather",
                output={"temp_f": 55, "condition": "rainy"}
            )

            verdict = await judge.evaluate(
                response.chat_messages[0],
                intent="Reports rainy weather for Berlin"
            )
            assert verdict.success, verdict.reason
```

Run tests:

```sh theme={null}
uv run pytest tests/ -m integration
```

## API reference

### TestSession

| Parameter      | Type  | Description                        |
| -------------- | ----- | ---------------------------------- |
| `llm`          | `LLM` | LLM instance with tools registered |
| `instructions` | `str` | System instructions for the agent  |

**Methods:**

* `simple_response(text: str) -> TestResponse` — Send user text and capture response
* `mock_functions(mocks: dict) -> ContextManager[dict[str, AsyncMock]]` — Mock tools with call tracking

### TestResponse

| Property         | Type                      | Description             |
| ---------------- | ------------------------- | ----------------------- |
| `input`          | `str`                     | User input text         |
| `output`         | `str \| None`             | Final assistant message |
| `events`         | `list[RunEvent]`          | All captured events     |
| `function_calls` | `list[FunctionCallEvent]` | Tool call events        |
| `chat_messages`  | `list[ChatMessageEvent]`  | Message events          |
| `duration_ms`    | `float`                   | Response time           |

### LLMJudge

| Parameter | Type  | Description                          |
| --------- | ----- | ------------------------------------ |
| `llm`     | `LLM` | Separate LLM instance for evaluation |

**Methods:**

* `evaluate(event: ChatMessageEvent, intent: str) -> JudgeVerdict` — Evaluate response against intent

### JudgeVerdict

| Property  | Type   | Description                      |
| --------- | ------ | -------------------------------- |
| `success` | `bool` | Whether the intent was fulfilled |
| `reason`  | `str`  | Explanation of the verdict       |

## Next steps

<CardGroup cols={2}>
  <Card title="MCP and function calling" icon="plug" href="/guides/mcp-tool-calling">
    Register tools for your agent
  </Card>

  <Card title="Simple agent example" icon="code" href="/examples/simple-agent">
    Build a basic agent with tools
  </Card>
</CardGroup>