RLMs for Video Understanding
Spent the last few weeks going deep on RLMs — and I finally have something worth sharing. This is the story of Sanjaya, my RLM for video understanding.
monty (from pydantic). The next idea was to just collate research via RLMs — the classic RLM video example — but with pydantic AI + monty.
RLMs aren't just tool-calling agents. The model writes programs — with real variables, loops, branching. The transcript isn't dumped in the prompt. It's accessed like a real variable, through code the model wrote itself.
Here's the RLM loop in action — Sanjaya writing its own visualization strategy:
from sanjaya import Agent
agent = Agent(model="openrouter:openai/gpt-5.3-codex")
answer = agent.ask(
"What scenes show people talking in this video?",
video="path/to/video.mp4"
)
print(answer.text)
The model decides the strategy. Sanjaya doesn't tell it how to search — it writes the code to do it:
windows = list_windows("people talking") # Sanjaya's idea
clips = [extract_clip(w) for w in windows]
results = vision_query_batched(clips)
Progressive scanning means the model tracks what it's already seen. Each call only returns fresh segments. No redundant processing, no context overflow.
Named it Sanjaya — from the Mahabharata. He had divine sight and narrated the war to a blind king. An RLM for video has divine sight over time — it can narrate what's in a video, finding what matters, whenever you ask.
Code: github.com/pratos/video-rlm
Docs: 22c055bb.sanjaya-x-article.pages.dev