Every LLM call costs real money. Most LLM features accept anonymous input, forward it to a paid API, and return the result. That is not a chatbot. That is a wallet with a text box on top — and the internet can type.
Unbounded Consumption is the LLM version of a denial-of-service bug, except the resource being exhausted is your credit card. Without caps on input size, output size, and request rate, any public LLM feature is a pay-per-token pipe that anyone on the internet can open.
What your AI actually built
You wanted a public demo of your AI feature, so you skipped the signup wall. Visitors type a message, your server forwards it to Claude or GPT, the reply comes back. It was supposed to be a taste. It works as advertised.
Nothing on the path limits how long the prompt can be, how many requests a single IP can send, or how many tokens any one response can burn. The upstream model has a 200k context window, and your bill scales with it.
This is a classic denial-of-wallet bug. Attackers do not need a vulnerability — they just need your endpoint and a for-loop. The model is happy to process 200k token prompts forever. You are the one paying for it.
How it gets exploited
A public 'try our AI' page with no account required. Each request is forwarded straight to a paid LLM API.
- 1Find the endpointAn attacker opens the network tab and sees POST /api/chat returning model output. No auth header, no CAPTCHA, no rate limit in sight.
- 2Measure the costThey send one big prompt — 100k tokens of lorem ipsum — and the server happily forwards it. The response takes 40 seconds and the bill meter ticks.
- 3ParallelizeA ten-line script opens 200 concurrent connections, each sending a new 100k-token prompt. The server fans them all out to the upstream API.
- 4Let it run overnightEight hours later, your OpenAI dashboard shows $42,300 in usage. The attacker paid nothing. Your autopay succeeded.
A demo feature burned through a month of runway in a single night. No data was stolen — the damage was the invoice.
Vulnerable vs Fixed
// app/api/chat/route.ts
export async function POST(req) {
const { message } = await req.json();
const reply = await anthropic.messages.create({
model: 'claude-3-5-sonnet',
max_tokens: 4096,
messages: [{ role: 'user', content: message }],
});
return Response.json({ reply });
}// app/api/chat/route.ts
import { rateLimit } from '~/lib/rate-limit';
const MAX_INPUT_CHARS = 4000;
export async function POST(req) {
const ip = req.headers.get('x-forwarded-for') ?? 'unknown';
const ok = await rateLimit.check(ip, { max: 20, window: '1h' });
if (!ok) return new Response('Too many requests', { status: 429 });
const { message } = await req.json();
if (typeof message !== 'string' || message.length > MAX_INPUT_CHARS) {
return new Response('Message too long', { status: 413 });
}
const reply = await anthropic.messages.create({
model: 'claude-3-5-sonnet',
max_tokens: 512,
messages: [{ role: 'user', content: message }],
});
return Response.json({ reply });
}Three caps, all boring, all essential. A ceiling on input length, a ceiling on output tokens, and a per-IP rate limit. None of these make the feature worse for real users — they just stop the feature from being a free backend for everyone else.
A real case
A public AI demo racked up a five-figure bill overnight
A small team shipped an unauthenticated chat endpoint for their launch. By morning, a single scripted attacker had burned through tens of thousands in inference costs — no data stolen, just the invoice.
Related reading
References
Find out how much your chatbot costs per attacker.
Flowpatrol probes every LLM endpoint for missing rate limits and unbounded inputs. Five minutes. One URL.
Try it free