Building a Multi-Tenant RAG Agent Platform with Go, Qdrant & the Vercel AI Gateway
How Nexora ingests PDFs, DOCX, websites and Q&A pairs, chunks and embeds them with OpenAI's text-embedding-3-small, stores 1536-dim vectors per-agent in Qdrant, and streams grounded answers via SSE. Real architecture: Go + Gin + GORM + Asynq workers + Vercel AI Gateway.
Building a Multi-Tenant RAG Agent Platform with Go, Qdrant & the Vercel AI Gateway
Last updated: May 2026 · By JB (Muke Johnbaptist) — architecture lifted from the Nexora repo I shipped this year.
A SaaS RAG platform isn't "stuff documents into a vector DB and call OpenAI". It's a half-dozen background jobs, three storage layers, careful tenant isolation, and a streaming response that's actually grounded in the right chunks. Get any one of those wrong and you ship a chatbot that hallucinates customer-support answers.
This guide is the full architecture behind Nexora — a multi-tenant RAG agent builder where every customer gets their own agents, their own knowledge bases, and their own per-workspace vector collections. Go + Gin on the backend, Qdrant for vectors, Asynq for jobs, and the Vercel AI Gateway so the model layer is one config flip away from Claude → Gemini → GPT.
If you've read the Vercel AI Gateway setup and want to know what a real production RAG looks like sitting on top of it, this is the post.
TL;DR — what you're getting
- A multi-tenant RAG architecture: per-agent Qdrant collections, workspace-scoped APIs, credit-metered ingestion.
- Ingestion pipeline that handles PDF, DOCX, CSV, XLSX, URLs (crawled), and direct Q&A pairs.
- Chunking at 2048 chars with 200-char overlap on sentence boundaries.
- Embeddings via
openai/text-embedding-3-small(1536 dims, cosine), batched 20 at a time. - Streaming answers over SSE with a
session → chunks → message → doneevent sequence. - Background jobs in Asynq so ingestion never blocks the API.
- The whole model layer goes through the Vercel AI Gateway — swap providers with one string.
The big picture
┌─────────────────────┐
│ Admin UI (Next.js) │ upload PDF, paste URL, add Q&A
└──────────┬──────────┘
│ POST /api/agents/:id/sources
▼
┌─────────────────────┐
│ Gin HTTP API (Go) │ validates, persists KnowledgeSource(status=PENDING)
└──────────┬──────────┘
│ enqueue("ingest:source:123")
▼
┌─────────────────────┐
│ Asynq Worker (Go) │
│ 1. download/parse │
│ 2. chunk │
│ 3. embed (batch) │
│ 4. upsert Qdrant │
│ 5. mark READY │
└─────────────────────┘
┌─────────────────────┐
End-user chat → │ POST /chat (SSE) │
│ 1. embed query │
│ 2. Qdrant top-K │
│ 3. streamText() │
│ 4. emit events │
└─────────────────────┘
The split matters: ingestion is async, chat is sync-streaming. They share Qdrant but never block each other.
Stack
| Layer | Choice | Why |
|---|---|---|
| API | Go + Gin | Type-safe, deploys as a single binary, easy to ship to a VPS |
| ORM | GORM on PostgreSQL | Migrations + tenant scoping |
| Vectors | Qdrant | Per-tenant collections, cheap cloud tier, gRPC + REST |
| Jobs | Asynq on Redis | Retries, schedulers, dashboard |
| Embeddings | openai/text-embedding-3-small via Vercel AI Gateway | 1536 dims, cheapest mainstream embed |
| Chat models | anthropic/claude-sonnet-4.6 (default), openai/gpt-5.4-nano (cheap routes) | Mix per workspace |
| Parsing | ledongthuc/pdf (PDF), xuri/excelize (XLSX), gocolly/colly (URL) | Battle-tested Go libs |
| Auth | JWT (15m access / 7d refresh) + Google/GitHub OAuth | Cookie-based, httpOnly |
| Hosting | Contabo VPS + Docker via Dokploy | Cheap, simple, full control |
Step 1 — The data model
// models/agent.go
type Agent struct {
ID string `gorm:"primaryKey"`
WorkspaceID string `gorm:"index"`
Name string
Model string // e.g. "anthropic/claude-sonnet-4.6"
SystemPrompt string
Temperature float32
CreatedAt time.Time
}
type KnowledgeSource struct {
ID string `gorm:"primaryKey"`
AgentID string `gorm:"index"`
Type SourceType // PDF | DOCX | URL | QNA | CSV | XLSX
Name string
Size int64
Status SourceStatus // PENDING | PROCESSING | READY | FAILED
StatusStep string // "downloading" | "chunking" | "embedding" | "indexing"
Error *string
ChunkCount int
CreatedAt time.Time
}
type Workspace struct {
ID string
Plan PlanTier // FREE | STARTER | GROWTH | AGENCY
Credits int // ingestion credits
StorageBytes int64
}The Status + StatusStep enums are the single source of truth the UI polls. Don't try to be clever — your users want to see "Chunking page 12 of 88" not a spinner.
Step 2 — Per-agent Qdrant collections (the tenancy boundary)
When an agent is created, immediately create a Qdrant collection for it:
// services/qdrant_service.go
func (s *QdrantService) EnsureAgentCollection(ctx context.Context, agentID string) error {
name := fmt.Sprintf("agent_%s", agentID)
return s.client.CreateCollection(ctx, &qdrant.CreateCollection{
CollectionName: name,
VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
Size: 1536, // text-embedding-3-small
Distance: qdrant.Distance_Cosine,
}),
})
}🎯 Why one collection per agent, not one collection per workspace with metadata filters? Because filters scan, but collection lookups are O(1). At 100k chunks per agent and dozens of agents per workspace, the metadata-filter approach gets 10× slower. Pay the per-collection overhead once; query fast forever.
Tenant isolation comes for free — collection name carries the agent ID, no way to accidentally bleed between customers.
Step 3 — The ingestion job (the part that does the work)
User POSTs a source → API saves KnowledgeSource{status:PENDING} and enqueues an Asynq task. The worker picks it up:
// jobs/workers.go
func (w *IngestWorker) Handle(ctx context.Context, t *asynq.Task) error {
var p IngestPayload
if err := json.Unmarshal(t.Payload(), &p); err != nil {
return err
}
src, err := w.repo.GetSource(p.SourceID)
if err != nil {
return err
}
// Hard 5-minute deadline so a stuck job doesn't poison the queue
ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
defer cancel()
// 1. Download / read content
w.markStep(src, "downloading")
text, err := w.parser.Extract(ctx, src)
if err != nil {
return w.fail(src, fmt.Errorf("extract: %w", err))
}
// 2. Chunk
w.markStep(src, "chunking")
chunks := w.chunker.Chunk(text, 2048, 200) // size, overlap
// 3. Embed (batches of 20)
w.markStep(src, "embedding")
vectors := make([][]float32, 0, len(chunks))
for i := 0; i < len(chunks); i += 20 {
end := min(i+20, len(chunks))
batchCtx, cancelB := context.WithTimeout(ctx, 60*time.Second)
embs, err := w.embedder.EmbedBatch(batchCtx, chunks[i:end])
cancelB()
if err != nil {
return w.fail(src, fmt.Errorf("embed batch %d: %w", i, err))
}
vectors = append(vectors, embs...)
}
// 4. Upsert into the agent's collection
w.markStep(src, "indexing")
if err := w.qdrant.UpsertPoints(ctx, src.AgentID, chunks, vectors, src.ID); err != nil {
return w.fail(src, fmt.Errorf("upsert: %w", err))
}
// 5. Done
return w.repo.MarkReady(src.ID, len(chunks))
}Asynq handles retries, exponential backoff, and dead-lettering. Heavy lifting stays off the API thread.
Step 4 — The chunker (the part everyone underestimates)
Bad chunking is the #1 cause of "the answers are wrong even though the docs are loaded." The rule: chunks should be self-contained but overlap enough to not split a sentence.
// services/chunker.go
func (c *Chunker) Chunk(text string, size, overlap int) []string {
sentences := splitSentences(text) // regex on . ! ? + capital
var chunks []string
var current strings.Builder
var currentSentences []string
flush := func() {
if current.Len() == 0 {
return
}
chunks = append(chunks, strings.TrimSpace(current.String()))
// overlap: keep last few sentences for the next chunk
keep := []string{}
keptLen := 0
for i := len(currentSentences) - 1; i >= 0; i-- {
if keptLen+len(currentSentences[i]) > overlap {
break
}
keep = append([]string{currentSentences[i]}, keep...)
keptLen += len(currentSentences[i])
}
current.Reset()
currentSentences = nil
for _, s := range keep {
current.WriteString(s + " ")
currentSentences = append(currentSentences, s)
}
}
for _, s := range sentences {
if current.Len()+len(s) > size {
flush()
}
current.WriteString(s + " ")
currentSentences = append(currentSentences, s)
}
flush()
return chunks
}2048 chars + 200 char overlap is the sweet spot for text-embedding-3-small. Smaller chunks fragment context; larger chunks dilute the embedding. The 200-char overlap means a paragraph spanning two chunks shows up in both, so a query that lands on the boundary still retrieves both.
For Q&A pairs I skip chunking entirely — "Q: ${q}\nA: ${a}" is one chunk, embed as-is. For URLs, Colly crawls one page, extracts main content (no nav, no footer), then runs the same chunker on the result.
Step 5 — Embeddings via the Vercel AI Gateway
The Go service hits the Gateway's OpenAI-compatible endpoint — no client SDK needed:
// services/embedding_service.go
type embedReq struct {
Model string `json:"model"`
Input []string `json:"input"`
}
func (e *EmbeddingService) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
body, _ := json.Marshal(embedReq{
Model: "openai/text-embedding-3-small",
Input: texts,
})
req, _ := http.NewRequestWithContext(ctx, "POST",
"https://ai-gateway.vercel.sh/v1/embeddings",
bytes.NewReader(body))
req.Header.Set("Authorization", "Bearer "+os.Getenv("AI_GATEWAY_API_KEY"))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
var out struct {
Data []struct {
Embedding []float32 `json:"embedding"`
} `json:"data"`
}
if err := json.NewDecoder(resp.Body).Decode(&out); err != nil {
return nil, err
}
vectors := make([][]float32, len(out.Data))
for i, d := range out.Data {
vectors[i] = d.Embedding
}
return vectors, nil
}That's the only OpenAI integration. Need to swap to Cohere embeddings? Change one string to "cohere/embed-english-v3.0" (and the dim count to match). The Gateway abstracts the rest.
📚 If you haven't set up the Vercel AI Gateway yet, read the setup guide first — it covers the
AI_GATEWAY_API_KEY, billing, and provider routing this whole platform stands on.
Step 6 — The chat handler (where the magic feels like magic)
// handlers/agent_chat.go
func (h *ChatHandler) Stream(c *gin.Context) {
var req ChatRequest
if err := c.BindJSON(&req); err != nil {
c.JSON(400, err)
return
}
// SSE headers
c.Writer.Header().Set("Content-Type", "text/event-stream")
c.Writer.Header().Set("Cache-Control", "no-cache")
c.Writer.Header().Set("Connection", "keep-alive")
c.Writer.Header().Set("X-Accel-Buffering", "no") // disable nginx buffering
flusher, _ := c.Writer.(http.Flusher)
agent, _ := h.repo.GetAgent(req.AgentID)
convo, _ := h.repo.GetOrCreateConversation(req.ConversationID, req.AgentID)
// 1. emit session
fmt.Fprintf(c.Writer, "event: session\ndata: %s\n\n",
toJSON(map[string]string{"conversationId": convo.ID}))
flusher.Flush()
// 2. embed query → retrieve top-K
qVec, _ := h.embedder.EmbedBatch(c, []string{req.Message})
hits, _ := h.qdrant.Search(c, agent.ID, qVec[0], 8)
// 3. emit retrieved chunks (for citations UI)
fmt.Fprintf(c.Writer, "event: chunks\ndata: %s\n\n", toJSON(hits))
flusher.Flush()
// 4. assemble prompt
contextBlock := strings.Builder{}
for i, h := range hits {
fmt.Fprintf(&contextBlock, "[%d] %s\n\n", i+1, h.Text)
}
messages := []ChatMessage{
{Role: "system", Content: agent.SystemPrompt + `
Use ONLY the context below to answer. If the answer is not in the context, say "I don't have that information".
Cite chunks by their [number] when you use them.
CONTEXT:
` + contextBlock.String()},
}
messages = append(messages, convo.History...)
messages = append(messages, ChatMessage{Role: "user", Content: req.Message})
// 5. stream from Gateway
err := h.llm.StreamChat(c, agent.Model, messages, func(token string) {
fmt.Fprintf(c.Writer, "event: message\ndata: %s\n\n",
toJSON(map[string]string{"token": token}))
flusher.Flush()
})
if err != nil {
fmt.Fprintf(c.Writer, "event: error\ndata: %s\n\n", err.Error())
return
}
fmt.Fprintf(c.Writer, "event: done\ndata: {}\n\n")
flusher.Flush()
}The four event types are the contract with the frontend:
session— here's your conversation ID, store itchunks— here are the retrieved sources, render as citation chipsmessage— token-by-token streamingdone— wrap up, persist conversation
Putting chunks before tokens is a UX win: users see citations appear instantly, then the answer streams in. Feels grounded even before the answer arrives.
Step 7 — The system prompt that makes RAG actually grounded
Models love to fall back on training data. The system prompt fights that:
You are {agent.name}, a knowledge agent for {workspace.name}.
RULES (non-negotiable):
1. Use ONLY the information in the CONTEXT block below.
2. If the context does not contain the answer, reply:
"I don't have that information in my current knowledge base."
3. NEVER invent product names, prices, policies, dates, or people.
4. Cite the chunk number in brackets like [1], [3] whenever you use it.
5. If the user's question is off-topic for the agent's purpose, politely redirect.
CONTEXT:
[1] {chunk 1 text}
[2] {chunk 2 text}
...
That single "ONLY the context" line plus the explicit "I don't have that information" out reduces hallucinations dramatically. The citation requirement is also a forcing function — a model that has to cite [n] is far less likely to make things up.
Step 8 — Plans, credits, and tenancy
Free-tier users can't be allowed to upload a 500MB PDF and burn $40 of embeddings. Two enforcement points:
// services/credits.go
const (
EmbedCostPerKB = 1 // 1 credit per KB of source text
)
func (s *CreditService) ChargeForIngestion(ws *Workspace, src *KnowledgeSource) error {
cost := int(src.Size/1024) * EmbedCostPerKB
if ws.Credits < cost {
return ErrInsufficientCredits
}
ws.Credits -= cost
return s.repo.Save(ws)
}| Plan | Monthly credits | Per-source size cap | Concurrent agents |
|---|---|---|---|
| Free | 50 | 25 KB | 1 |
| Starter | 1,000 | 1 MB | 3 |
| Growth | 10,000 | 10 MB | 10 |
| Agency | Unlimited | 100 MB | Unlimited |
Charge before ingestion starts, refund on FAILED. That way users can't game it by uploading and aborting.
Step 9 — Production lessons (the painful ones)
1. SSE breaks behind Cloudflare with caching on
Set Cache-Control: no-cache and a Cloudflare page rule "Bypass cache" for the chat endpoint, and X-Accel-Buffering: no for any Nginx in the chain. Miss any one and you get a 30-second hang followed by the entire answer dumped at once.
2. Qdrant collection creation is not atomic
If you create the collection lazily on first ingest, two concurrent uploads will race and one will 409. Create the collection eagerly when the agent is created.
3. PDF extraction is dirty work
ledongthuc/pdf works for ~85% of PDFs. For scanned PDFs you'll need OCR (Tesseract or a cloud OCR service). I default to Tesseract via gosseract and queue it as a separate ocr:source:id job.
4. URL crawls go infinite without bounds
Colly will happily crawl your entire site. Cap with MaxDepth: 2, Async: true, Parallelism: 4, and a hard 30-second timeout per page.
5. Embedding rate-limits hit fast
text-embedding-3-small allows ~3000 RPM on a fresh OpenAI account, but the Vercel Gateway pools across customers and your effective limit is much higher. Even so — back off on 429s with exponential jitter, not fixed retry.
6. Cosine vs dot product
For OpenAI embeddings, both work because the vectors are normalised. Stick with cosine in Qdrant — it's the documented contract and any future model swap is more likely to be cosine-compatible.
7. Conversation history compresses
Don't dump 50 turns of convo.History into every prompt. Keep the last 10 verbatim, summarise older turns into one system message, and re-summarise every 20 turns. Costs nothing, accuracy way up.
Use cases this pattern unlocks
| Vertical | Knowledge sources | What the agent does |
|---|---|---|
| Customer support | Help docs, past tickets, product FAQs | Auto-answer L1, escalate L2 with context |
| Internal IT | Wiki, runbooks, Slack history | "How do I get VPN access?" → answer + links |
| Sales enablement | Pitch decks, case studies, pricing | Reps query "Show me healthcare wins under $50k MRR" |
| Legal / compliance | Contracts, policies | "Is this clause GDPR-compliant?" with citations |
| Education | Course materials, past Q&A | Tutor agent grounded in the syllabus |
| Real estate | Listings, neighbourhood guides | "3 bed in Ntinda under $300k with parking" |
Anywhere your team currently CTRL-F's through a Notion / Confluence / Drive folder is a candidate.
Frequently asked questions
Why Go for the backend? Doesn't Next.js / Python do this fine?
Go for the API and workers because (a) the workers are long-running and CPU-heavy (chunking, parsing PDFs), (b) a single binary deploys cleanly to a VPS without dependency hell, and (c) goroutines let me parallelise embedding batches trivially. The frontend is still Next.js. Best tool per layer.
Why Qdrant over Pinecone / Weaviate / pgvector?
Qdrant Cloud's free tier covers small dev workspaces, the self-hosted Docker image is one container, and per-collection isolation is first-class. Pinecone is fine but more expensive and harder to self-host. pgvector works at small scale but doesn't compete on recall once you're past ~500k vectors per tenant.
Can I use Claude instead of OpenAI for embeddings?
Claude doesn't publish an embeddings endpoint. The Gateway routes embedding requests through providers that do — OpenAI, Cohere, Voyage. For chat generation you can absolutely use Claude (it's my default for agent answers).
How do I deduplicate ingested content?
Hash each chunk (sha256(chunk.text)) and skip upsert if the hash already exists in the agent's collection. Cheap, no false positives, handles re-uploads gracefully.
How do I handle very long documents (500+ pages)?
Stream-parse the PDF instead of loading it into memory, chunk as you go, embed in 20-chunk batches as you produce them, and persist incremental progress. Don't try to hold the whole document in RAM — a 1000-page PDF can blow past 2GB after parsing.
What's the cheapest end-to-end setup for a side project?
- Qdrant: self-host on a $5/mo VPS (1GB RAM handles ~100k vectors)
- Vercel AI Gateway: pay-per-token, no minimums
- Postgres + Redis: same VPS via Docker
- Dokploy to orchestrate
Total: ~$10/mo + actual AI usage. Same architecture as Nexora's production setup, just denser packing.
Related reading
- Vercel AI Gateway complete setup guide — the model & embedding layer underneath
- AI Tool Calling with Custom UI Components — bolt tool calls onto your RAG agent for actions, not just answers
- From CRUD to MCP Server — expose your RAG as MCP tools so Claude Desktop can query your knowledge base
- Multi-model AI content pipelines — orchestrating multiple models in one workflow
- Complete Guide to AI Integration with Vercel AI SDK
Need help shipping a RAG agent platform?
I build RAG-backed agents for SaaS, internal tooling, and customer-support automation — ingestion pipelines, vector infra, streaming chat, tenancy.
- 📞 Book a session — design / code review / setup. Sessions from UGX 50,000.
- 💼 Hire Desishub — full RAG platform builds: desishub.com
- 📺 YouTube — practical AI engineering: @JBWEBDEVELOPER
- 💻 Reference repo: github.com/MUKE-coder/nexora
Resources
- Vercel AI Gateway: vercel.com/docs/ai-gateway
- Qdrant docs: qdrant.tech/documentation
- Asynq: github.com/hibiken/asynq
- OpenAI embeddings: platform.openai.com/docs/guides/embeddings
- Colly (Go web scraper): go-colly.org
- Gin web framework: gin-gonic.com

