Recently I shipped a slightly unhinged thing: an agentic system that generates and edits complex Form.io-based forms from natural language. Think: "build me an onboarding form for MSME loan applications" and the system orchestrates a bunch of agents to produce the full schema, validations, conditional logic, and data plumbing.
The first version lived the good life on a generous Gemini API setup, where context window wasn't really a constraint and my worst sin was "eh, just stuff more context in." Then the infra reality hit: I had to move the whole thing to a GPU-hosted model with a 32k token window shared across system prompts, tools, examples, user input, and prior steps.
Overnight, the problem stopped being "how do I make this smart?" and became "how do I fit this?" The original multi-agent orchestration would casually munch through 60–65k tokens of context across calls. The new world needed it to run comfortably under 16k while still understanding a zoo of forms and their interactions.
Version 0: Multi-agent, everything-in-context chaos
The first architecture was classic 2024 agent hype:
- One agent had "global knowledge" of all existing forms, their JSON schemas, and cross-form relationships.
- Another agent handled UX tweaks: labels, help-text, microcopy.
- A third agent validated constraints: required fields, regexes, conditional visibility, data mappings.
- They all talked to each other through a controller that just… kept appending more context.
With Gemini, this was tolerable. I could afford to dump large chunks of form JSON, previous attempts, and some examples into a single call and let the model reason globally. The system was expensive but behaved surprisingly well: strong global coherence, good reuse of existing components, and decent explanations of changes.
The moment I moved to a self-hosted model with a 32k window, this stopped working. Even a single "validate + revise" pass across a handful of big forms would graze 30k tokens, and anything involving multi-form workflows simply refused to fit.
RAG was necessary, but not sufficient
The obvious first move was RAG (Retrieval-Augmented Generation). Take everything you've got—form JSONs, component descriptions, validation rules—embed them into a vector store, and on each request pull back the top few things that look relevant. In the early days this felt almost magical: ask for "a variant of this KYC form" or "add a GST field to this invoice form" and the retriever dutifully surfaces the right neighbors so the model doesn't need to see the whole zoo at once.
But real-world form builders are slightly evil. People don't say "please edit field customer_pan." They say "make this behave like our employee onboarding, but for vendors," which secretly touches three forms, two workflows, and a legacy constraint buried in an example payload from 2022. Constraints are scattered across descriptions, schemas, sample submissions, even stray comments. Cross-form interactions like ID propagation live nowhere in particular. Plain RAG helps you not drown, but it doesn't tell you what to actually show the model for this step without wasting half the window.
Plain RAG kept me under the token limit for some tasks, but for complex workflows it still dragged in way too many chunks. I needed something that didn't just retrieve, but structured what was retrieved.
CAG: Context-Aware Generation (or, RAG with a brain)
So I ended up building a thin layer on top of RAG that I started calling CAG – Context-Aware Generation.
Short version: let RAG dig up the pieces, let CAG decide which tiny slice the model is actually allowed to see.
For the Form.io system, CAG meant admitting that the raw documents weren't the right unit of context. I ended up building a graph of forms, sections, and fields as first-class nodes, with edges for reuse and dependencies. On top of that graph I layered little summaries: one-line stories for each form ("MSME loan v2, 4 sections, 27 fields"), compact blurbs for sections ("KYC: 7 identity fields, 2 address fields, 1 document upload") and canonical descriptions for fields ("PAN: alphanumeric, regex X, reused in 5 forms"). When a task came in, I didn't just ask "what chunks are relevant?" I walked that graph to assemble a tiny dossier: a page or two of structured overview plus a handful of raw JSON examples.
So instead of dumping 30 pages of JSON into the prompt, each step now got a little "context packet"—enough structure for the model to understand the neighborhood it was operating in, plus a couple of real examples to keep it honest. RAG picked the neighborhood; CAG decided what actually made it into the backpack for this leg of the journey.
This alone knocked down a lot of waste. But the real problem remained: the overall pipeline was still shaped like "one giant interaction" in the model's eyes. Tokens leaked everywhere. I needed to change the shape of the computation, not just compress its inputs.
New constraint, new architecture: tokens vs calls
With Gemini, my mental model was:
"Every round-trip to the model costs money, so try to do as much as you can in one go." Fewer calls felt like good engineering hygiene. As long as the context window was huge, I could treat it as a cheap, global scratchpad and overstuff it with everything remotely relevant.
On the GPU box, that story died instantly. The GPU is already powered on and bored; one more forward pass is almost free compared to the cost of blowing past 32k tokens and face-planting. The constraint flipped: LLM calls became cheap, and context became the scarce resource you budget for like RAM on an old laptop.
The advice I got (which in hindsight feels obvious) was:
Short version: stop worshipping one fat agent, start shipping lots of small jobs on a queue.
Enter RabbitMQ: turning the agent into a queue of micro-tasks
The only way this was going to work was if the "one big agent" fantasy died too. That click actually came out of a long conversation with Harsh Nisar (co-founder at BharatDigital). I walked into that chat thinking mostly in terms of "smarter agents" and walked out with my first real introduction to message queues and producer–consumer patterns. Harsh patiently walked me through how you break work into messages, let workers pick them up, and wire the whole thing together. Off that nudge, I pulled the system apart and rewired it around RabbitMQ. Instead of one long chat with a very overloaded brain, every user request now gets broken into a bunch of small, well-labeled errands.
There’s even a little requirements_agent sitting in front of the queues now. It takes the big, messy human prompt and, using a mix of prompt tricks and hybrid NER, teases out all the individual asks hiding inside it (add a GST field here, mirror that KYC logic over there, make this section optional for vendors, and so on). Each of those requirements gets its own ID and a bit of metadata and becomes an independent task instead of vibes in a paragraph.
Those tasks are what show up as messages on the queues. A message carries a task_type like draft_section_schema or infer_validations, a few IDs that say "we're talking about this form and this section and this requirement," a tiny CAG-built context packet, and pointers to any previous results stored in the database. Workers listen on their queues, grab one message at a time, do exactly one model call with a very small prompt, write the result back, and acknowledge. All the long-term memory lives in the DB and the messages; the model just gets a little local slice of it for the one decision it needs to make right now.
Global coherence stopped being the model's job and became the orchestrator's job. The LLM no longer has to remember everything; it just has to not mess up the tiny neighborhood it's currently editing.
What a single micro-task looks like
Here is what a typical RabbitMQ message looks like conceptually (simplified for the blog):
{
"task_id": "draft-section-kyc-002",
"task_type": "draft_section_schema",
"target_form_id": "loan_application_v3",
"target_section": "kyc_details",
"context_packet": {
"section_name: [...] //name of the section"
"change_description": [...] //description of the change
"component_type": [...] //type of the component
"component_schema": [...], // schema for the chosen component
}
}
The worker that owns draft_section_schema:
It doesn't "generate a section from scratch" in one shot. It receives a task, looks at the component_type and the current component_schema, reads the change_description, and then does the boring part: apply the change, regenerate the complete JSON for that component/section, and run validation. If validation passes, it ACKs the message and moves on to the next queued task. If it doesn't, it fixes what broke and tries again (still within the tight, local context of that one task).
Crucially, this call never needs to see the entire world. It doesn't know about every form ever created, just the 1–2 forms/sections that are actually relevant.
Validation, repair, and the 3-pass context bomb
One of the nastiest hot spots wasn't even generation, it was validation and repair. The old design would:
- Generate a draft form.
- Run a "validation + suggestions" pass over the entire form.
- Run a "repair" pass that again saw the whole form + the validation commentary.
In practice, that meant the same bloated context was being fed back into the model up to 3 times. Every loop dragged along the full form JSON, prior errors, and explanations. On the Gemini setup this was just "a bit expensive." On the 32k window GPU, it was a context bomb: three passes was enough to blow past the limit for any realistically complex form.
In the queue-shaped world, validation and repair stopped being a monologue and turned into housekeeping. The controller emits lots of tiny validate_component or validate_section tasks, each seeing only a small patch of the form plus its immediate constraints. Anything that fails validation gets its own repair_component task. The model still ends up validating and fixing the whole form, but it does it in slices. Nothing ever has to lug the full JSON around three times just because one field decided to be weird.
Edit mode: new vs existing components
The other huge offender was edit mode. When a user wanted to tweak an existing form ("add a GST field here", "make the PAN optional", "move this section below KYC"), the naive thing I did at first was: send the current_form JSON in full, plus the edit instruction, and ask the model to "apply the diff." Same story as before — every small edit paid for the entire form in tokens.
The fix was to make edit-mode itself conditional and local. When edit_mode is activated, the pipeline first runs a lightweight classification step:
Short version: take the edit prompt once, let the LLM quietly sort every instruction into "new component" or "existing component", and only then touch the JSON.
Once that decision is made, the rest of the flow is surprisingly easy. If the instruction is about a new component, the system quietly figures out where it should go (usually by grabbing the immediate previous and next elements in that section), asks the model to generate just the JSON for the newcomer, and then splices it into place. If it's about an existing component, it pulls only that component (or that small group), lets the model rewrite or repair it, and then drops the updated JSON back in.
Real users of course don't respect this neat separation, they'll say "add a new toggle here and also fix that label" in one breath. Internally that just becomes two little to-do lists: one for "things to add," one for "things to touch." Each list turns into its own batch of RabbitMQ tasks. Edit mode stopped being "send the whole form and hope the model patches it correctly" and turned into a structured diff engine that only ever ships the few components that actually need attention.
Numbers: from >60k tokens to <16k
Once the queue architecture was in place, the token story changed a lot:
- The "everything in one go" version would easily cross 60–65k tokens of accumulated context across a realistic multi-form task.
- With RAG+CAG but without queues, I could push that down, but complex flows still hovered around the upper 30ks.
- With RabbitMQ-style micro-tasks:
- Most individual calls stay within 0.5–2k tokens total (prompt + context + output).
- Even the "heavier" steps sit safely under 10–12k.
- The hard upper bound per call is now well under the 16k budget.
The system as a whole still "sees" a lot of information over time, but no single LLM invocation is asked to juggle everything at once. The GPU doesn't care that there are 20 calls instead of 5 as long as each call fits in the window.
Why a message queue beats a giant agent here
A few things became very obvious after the rewrite:
- Observability got easier. Each task type has narrow responsibilities and metrics:
- average tokens per call,
- average latency,
- failure rate (invalid JSON, violated constraints, etc.).
- Backpressure is natural. If validation starts lagging, messages pile up in that specific queue. I don't need to guess where the "agent is stuck."
- Retries are cheap. A failed micro-task can be retried with the same or slightly tweaked context without re-running the whole orchestration.
- Concurrency is safe. Different sections or even different forms can be processed in parallel without blowing the context budget on any one call.
And philosophically, it aligns with how I now think about LLM systems in general:
Short version: don't build one massive agent with vibes—build a boring little fleet of workers that only ever think locally.
Lessons learned (and what I'd do differently next time)
If I had to rebuild this from scratch, I'd start from the constraints instead of discovering them halfway through. Treat the token ceiling like a hard law of physics, not a suggestion. Treat plain RAG as table stakes and plan for some kind of CAG layer from day one, because "what to show the model for this specific step" turns out to be most of the real work. And I'd let the infra dictate the abstractions earlier: if your environment makes small calls cheap and big contexts painful, don't fight it out of purity. Embrace the fact that a queue of boring little jobs will usually beat one majestic agent chatting with itself.
The side effect of all this is that the system now looks less like an AI research toy and more like any other production backend: logs you can grep, queues you can stare at, DB tables that tell you what actually happened. The "intelligence" sits behind lots of small, inspectable steps instead of one giant prompt you pray over.
P.S. From the outside this won't look clever. You'll type a messy one-liner about "make this like that," hit enter, and something quiet will shuffle tasks through queues until the JSON behaves. All the queuing and token gymnastics exist so that part feels boring (in a good way). The intern from the last blog helped draft this one too, I just made it queue things properly.