Sanjaya — RLMs for Video Understanding

THREAD

Spent the last few weeks going deep on RLMs — and I finally have something worth sharing. This is the story of Sanjaya, my RLM for video understanding.

The why

Content work + cost constraints

I was doing a lot of content work — sequencd, research across client projects at agencies, slide decks, articles, video processing. That last one was interesting: we built a video pipeline to extract people talking reliably from long videos.

Cost was always a constraint. And I was fascinated by RLMs — the idea that instead of prompting a model to answer, you prompt it to write a program that answers.

First attempt

The editorial agent (typical agent, not RLM)

First version was an "editorial agent" — but it ended up being a typical tool-calling agent, not truly RLM.

I ripped out the Repl implementation and replaced it with monty (from pydantic). The next idea was to just collate research via RLMs — the classic RLM video example — but with pydantic AI + monty.

The twist

Monty's limitation sparked an idea

Monty's limitation — it's not a sandbox like Daytona or Codespaces — gave me an idea for the next version.

Since I had figured it out just a day before the Codex Community Hackathon, I didn't plan to present it.

The epiphany

On the way to the hackathon

While travelling to the venue, chatting to ChatGPT, something clicked. Prime Intellect had posted about RLMs. Multi-model integration was the one thing not fully figured out. I searched for "long context video problems", found a few papers, and decided to vibe research/code using Codex that day.

Surprisingly, it worked — for 15-20 minute videos. Though it still wasn't very "RLMish".

After the hackathon

2-3 days of iteration

Hacked on it for 2-3 more days, a few hours each session. Got to a version that actually works.

Now after running a "poor man's Gepa" on 12 prompts, it feels solid.

Core insight

RLMs aren't just tool-calling agents. The model writes programs — with real variables, loops, branching. The transcript isn't dumped in the prompt. It's accessed like a real variable, through code the model wrote itself.

Here's the RLM loop in action — Sanjaya writing its own visualization strategy:

from sanjaya import Agent

agent = Agent(model="openrouter:openai/gpt-5.3-codex")
answer = agent.ask(
    "What scenes show people talking in this video?",
    video="path/to/video.mp4"
)
print(answer.text)

The model decides the strategy. Sanjaya doesn't tell it how to search — it writes the code to do it:

windows = list_windows("people talking")  # Sanjaya's idea
clips = [extract_clip(w) for w in windows]
results = vision_query_batched(clips)

Progressive scanning means the model tracks what it's already seen. Each call only returns fresh segments. No redundant processing, no context overflow.

~20

minute videos

test prompts

core tools

"Not a tool-calling agent. A model that writes programs."

— the RLM thesis

Named it Sanjaya — from the Mahabharata. He had divine sight and narrated the war to a blind king. An RLM for video has divine sight over time — it can narrate what's in a video, finding what matters, whenever you ask.

Code: github.com/pratos/video-rlm
Docs: 22c055bb.sanjaya-x-article.pages.dev