Skip to main content
Most agents need to have access to your content and documentation. Typically this involves reading URLs, markdown docs, PDFs and other material. Giving your agent access to this information is called RAG. The best practices for RAG or complex and go beyond the scope of the vision agents project. We recommend 2 options for getting started with RAG. Option 1: Gemini filesearch. This is a high level RAG service. It handles the complexity for you. Option 2: Turbopuffer. Turbopuffer is a very efficient database for building your own RAG. This gives you full control. If you want to see a full example, please review examples/03_phone_and_rag_example

1. Easy RAG with Gemini

Gemini’s File Search is the easiest way to add RAG to your agent. It handles chunking, embedding, and retrieval automatically.

Using the wrapper

from vision_agents.plugins import gemini

# Create and populate a file search store
store = gemini.GeminiFilesearchRAG(name="my-knowledge-base")
await store.create()  # Reuses existing store if found
await store.add_directory("./knowledge")  # Skips duplicates via content hash

# Use with GeminiLLM
llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[gemini.tools.FileSearch(store)]
)
The wrapper provides:
  • Store reuse: Automatically finds and reuses existing stores with the same name
  • Content deduplication: Skips uploading files that already exist (via SHA-256 hash)
  • Batch uploads: Uploads multiple files concurrently

2. RAG with Turbopuffer

Turbopuffer example

Here’s an example that uses Turbopuffer with vector & BM25 search.
from vision_agents.plugins import turbopuffer, gemini

# Initialize TurboPuffer RAG with hybrid search
rag = turbopuffer.TurboPufferRAG(
    namespace="my-knowledge",
    chunk_size=10000,  # Larger chunks = more context
    chunk_overlap=200,
)
await rag.add_directory("./knowledge")

# Create LLM with function calling
llm = gemini.LLM("gemini-2.5-flash")

@llm.register_function(description="Search the knowledge base")
async def search_knowledge(query: str) -> str:
    return await rag.search(query, top_k=5, mode="hybrid")

Understanding RAG

Sooner or later you’ll want full control over RAG. RAG can be pretty complex. Let’s go over what a typical RAG pipeline looks like:

1. Gathering documents

First you have to gather documents from URLs, folders, images, PDFs, external APIs (Slack/Notion etc.)

2. Parsing/enriching documents

Images, PDFs, URLs all need some parsing before they can be used. Tools like markdownify, Beautiful Soup, and WebBaseLoader come in handy for URLs. For OCR see the OCR benchmark: https://huggingface.co/spaces/ling99/OCRBench-v2-leaderboard

3. Chunking & Contextual retrieval

Large documents need to be split into smaller chunks for effective retrieval. Common strategies:
  • Fixed size: Split every N characters with overlap
  • Semantic: Split at sentence or paragraph boundaries
  • Recursive: Try multiple separators (paragraphs → sentences → words)

4. Embedding

Next, you need some way to translate text to an embedding. An embedding is basically a vector representation of “text meaning” for an LLM. The leaderboard for embedding models is visible here: https://huggingface.co/spaces/mteb/leaderboard

5. Vector database

Next, you want to store these embeddings in a vector database. One of the most innovative options in the space is Turbopuffer: https://turbopuffer.com/docs/hybrid

6. Combined queries

The best practice is to combine full text and vector search. The Turbopuffer guide on hybrid search is a good starting point: https://turbopuffer.com/docs/hybrid It’s also common to use AI to create different variations of the original search query text: https://developers.llamaindex.ai/python/examples/query_transformations/query_transform_cookbook/

7. Reranking

When you gather the results of vector and full text search you typically want to rerank (or summarize) the results.

Advanced RAG example with Turbopuffer

The Turbopuffer RAG provides:
  • Hybrid search: Combines vector (semantic) and BM25 (keyword) search
  • Reciprocal Rank Fusion: Merges results from both search methods
  • Configurable chunking: Control chunk size and overlap
You can see the full phone + RAG example in the repo: examples/03_phone_and_rag_example

Choosing between Gemini and Turbopuffer

FeatureGemini File SearchTurbopuffer
Setup complexitySimpleMore setup
ChunkingAutomaticConfigurable
Search typeManagedHybrid (vector + BM25)
ControlLessFull control
CostIncluded with GeminiSeparate service
Best forQuick prototypesProduction with custom needs