ai / vercel-ai-sdk / llm
Shipping an AI chat feature without blowing the budget
Streaming a model response with the Vercel AI SDK is one line. The engineering is in the rate limits, output caps, and step limits that keep a public chat bubble from costing a fortune.
- Published
- May 25, 2026
- Read
- 4 min
Contents
There is a chat bubble in the corner of this site. It took an afternoon to build with the Vercel AI SDK, and the harder part was not making it work, it was making sure it could not quietly cost us a fortune. Streaming a model response is easy now. The engineering is in the guardrails around it. Here is how we think about both.
The streaming part is genuinely simple
On the server, a route handler calls streamText with a model and the conversation, and returns a streaming response. On the client, useChat manages the message list and the streaming state. That is most of the feature.
import { streamText, convertToModelMessages } from 'ai'
export async function POST(req: Request) {
const { messages } = await req.json()
const result = streamText({
model: 'openai/gpt-4o-mini',
messages: convertToModelMessages(messages),
maxOutputTokens: 800,
})
return result.toUIMessageStreamResponse()
}A note for anyone arriving from an older tutorial: the SDK is on version 6 now, and a few v4 habits will not compile. Tools use inputSchema, not parameters. On the client, useChat no longer owns the input state, so you manage the text field yourself, and you send with sendMessage rather than the old append. If your code looks like a 2024 blog post, that is why it is throwing.
Tools, and the cap that prevents runaway loops
Tool calling is where this gets powerful and where the budget risk lives. You give the model functions it can call, each with a schema and an execute, and it decides when to use them. The danger is the agentic loop: the model calls a tool, sees the result, calls another, and without a limit it can spiral.
const result = streamText({
model: 'openai/gpt-4o-mini',
messages,
tools: { lookupOrder: tool({ description: '...', inputSchema: OrderSchema, execute }) },
stopWhen: stepCountIs(5),
})That stopWhen: stepCountIs(5) is not optional in our book. A bare streamText has no step cap of its own, so a confused model in a tool loop will keep going. Five steps is plenty for a support assistant and it bounds the worst case.
The four guardrails that actually matter
Making the model talk is one line. Making it safe to expose to the public internet is the real work. Four things, in order of how much grief they save you:
Rate limit before the model call. Put a token bucket on the route, keyed by IP or user, and reject abuse at the edge. The cheapest request is the one you never send to the model. Ours returns a 429 long before any tokens are spent.
Cap the output. Set maxOutputTokens on every call. A missing cap is how a single prompt injection turns into a thousand-token essay you paid for.
Route by difficulty. Most questions do not need your most expensive model. Send the cheap, frequent work to a small model and reserve the big one for the cases that need it.
Always pass an abort signal. When a user closes the tab mid-stream, you want the generation cancelled, not running to completion on your bill.
One chokepoint for spend
The detail that ties it together: those string model IDs like 'openai/gpt-4o-mini' route through a gateway, which gives you a single place to set spend caps, watch usage, and fail over between providers without touching code. Swapping a model becomes a string change. For a small team, having one dashboard where the spending is visible and capped is worth more than any individual optimization.
The feature is a day of work. The guardrails are what let you sleep after you ship it. Build the second part with the same care as the first, and a public chat bubble stops being a liability and becomes just another well-behaved endpoint.