Building in Public

Sanjaya

RLMs for Video Understanding

🦊
Prathamesh Sarang
@pratos
ML Engineer & solo dev. Building agentic AI systems. Former invideo.ai.
THREAD

Spent the last few weeks going deep on RLMs — and I finally have something worth sharing. This is the story of Sanjaya, my RLM for video understanding.

The why
Content work + cost constraints
I was doing a lot of content work — sequencd, research across client projects at agencies, slide decks, articles, video processing. That last one was interesting: we built a video pipeline to extract people talking reliably from long videos.

Cost was always a constraint. And I was fascinated by RLMs — the idea that instead of prompting a model to answer, you prompt it to write a program that answers.
First attempt
The editorial agent (typical agent, not RLM)
First version was an "editorial agent" — but it ended up being a typical tool-calling agent, not truly RLM.

I ripped out the Repl implementation and replaced it with monty (from pydantic). The next idea was to just collate research via RLMs — the classic RLM video example — but with pydantic AI + monty.
The twist
Monty's limitation sparked an idea
Monty's limitation — it's not a sandbox like Daytona or Codespaces — gave me an idea for the next version.

Since I had figured it out just a day before the Codex Community Hackathon, I didn't plan to present it.
The epiphany
On the way to the hackathon
While travelling to the venue, chatting to ChatGPT, something clicked. Prime Intellect had posted about RLMs. Multi-model integration was the one thing not fully figured out. I searched for "long context video problems", found a few papers, and decided to vibe research/code using Codex that day.

Surprisingly, it worked — for 15-20 minute videos. Though it still wasn't very "RLMish".
After the hackathon
2-3 days of iteration
Hacked on it for 2-3 more days, a few hours each session. Got to a version that actually works.

Now after running a "poor man's Gepa" on 12 prompts, it feels solid.
Core insight

RLMs aren't just tool-calling agents. The model writes programs — with real variables, loops, branching. The transcript isn't dumped in the prompt. It's accessed like a real variable, through code the model wrote itself.

Here's the RLM loop in action — Sanjaya writing its own visualization strategy:

from sanjaya import Agent

agent = Agent(model="openrouter:openai/gpt-5.3-codex")
answer = agent.ask(
    "What scenes show people talking in this video?",
    video="path/to/video.mp4"
)
print(answer.text)

The model decides the strategy. Sanjaya doesn't tell it how to search — it writes the code to do it:

windows = list_windows("people talking")  # Sanjaya's idea
clips = [extract_clip(w) for w in windows]
results = vision_query_batched(clips)

Progressive scanning means the model tracks what it's already seen. Each call only returns fresh segments. No redundant processing, no context overflow.

~20
minute videos
12
test prompts
4
core tools
"Not a tool-calling agent. A model that writes programs."
— the RLM thesis

Named it Sanjaya — from the Mahabharata. He had divine sight and narrated the war to a blind king. An RLM for video has divine sight over time — it can narrate what's in a video, finding what matters, whenever you ask.

Code: github.com/pratos/video-rlm
Docs: 22c055bb.sanjaya-x-article.pages.dev

#RLM #VideoUnderstanding #AgenticAI #OpenSource #BuildingInPublic