• Agents
  • Docs
  • Pricing
  • Blog
Log in
Get started

Security for apps built with AI. Paste a URL, get a report, fix what matters.

Product

  • How it works
  • What we find
  • Pricing
  • Agents
  • MCP Server
  • CLI
  • GitHub Action

Resources

  • Blog
  • Docs
  • FAQ
  • Glossary

Security

  • Supabase Security
  • Next.js Security
  • Lovable Security
  • Cursor Security
  • Bolt Security

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Imprint
© 2026 Flowpatrol. All rights reserved.
Back to Blog
Engineering

Why We Use LLMs for Security Testing (And What They Actually Catch)

Traditional scanners match patterns. LLM-powered scanners read your app like a human would. Here's a side-by-side comparison of what each one finds — and misses — on the same endpoint.

Flowpatrol TeamMar 28, 202613 min read
Why We Use LLMs for Security Testing (And What They Actually Catch)

Same app, two scanners, very different results

Here's something we ran last month. We took a Next.js app built with Lovable — a basic invoice SaaS with Supabase on the backend — and pointed two scanners at it. One was OWASP ZAP, the most popular open-source security scanner in the world. The other was Flowpatrol's LLM-powered engine.

Same app. Same endpoints. Same five-minute window.

ZAP found 14 issues. Flowpatrol found 23. But the numbers aren't the interesting part. The types of findings are.

Here's the ZAP report, summarized:

ZAP Scan Results — invoice-app.lovable.app
============================================
HIGH
  - SQL Injection (reflected) — /api/search?q=
  - Cross-Site Scripting (reflected) — /api/search?q=

MEDIUM
  - Missing Content-Security-Policy header
  - Missing X-Frame-Options header
  - Missing Strict-Transport-Security header
  - Server leaks version info via X-Powered-By header
  - Cookie without SameSite attribute (3 instances)

LOW
  - X-Content-Type-Options header missing
  - Information disclosure — debug error messages
  - Timestamp disclosure — Unix timestamps in response

INFORMATIONAL
  - Modern web application detected
  - Non-storable content

Solid findings. The SQL injection is real. The XSS is real. The missing headers are real. ZAP did its job.

Now here's what Flowpatrol found on top of those same issues:

Flowpatrol Scan Results — invoice-app.lovable.app
===================================================
CRITICAL
  - Broken access control — GET /api/invoices/[id]
    Any authenticated user can read any invoice by
    changing the ID parameter. No ownership check.

  - [Supabase RLS](/blog/supabase-rls-the-security-feature-your-ai-forgot) disabled on `invoices` table
    Service role key used in client bundle. All rows
    readable via direct Supabase query.

HIGH
  - Privilege escalation — PUT /api/users/[id]
    Users can modify other users' profiles, including
    the `role` field. Sending {"role": "admin"} works.

  - Business logic flaw — POST /api/invoices
    Negative line item amounts accepted. An invoice
    for -$500 triggers a refund to the creator.

  - Insecure direct object reference — DELETE /api/invoices/[id]
    Any authenticated user can delete any invoice.
    Sequential IDs make enumeration trivial.

MEDIUM
  - NEXT_PUBLIC_SUPABASE_SERVICE_ROLE_KEY in JS bundle
    Service role key exposed in client-side code.
    Bypasses all RLS even if RLS were enabled.

  [... plus the same header/cookie findings ZAP reported]

ZAP caught the injection and the XSS. Those are pattern-matching wins — send a payload, check the response, flag the match. That's what rule-based scanners are built for, and they're good at it.

But ZAP didn't touch the access control issues. It didn't notice that changing an invoice ID returns someone else's data — a textbook IDOR vulnerability. It didn't try sending a negative dollar amount. It didn't check whether the role field was writable. It couldn't — because those aren't patterns. They're logic.

Side-by-side comparison showing rule-based scanner findings versus LLM-powered scanner findings on the same application


Why pattern matching has a ceiling

Traditional scanners work by doing the same thing over and over: send a known-bad input, check if the response looks like a known-bad output. SQL injection? Send ' OR 1=1 -- and see if the database errors or returns extra rows. XSS? Send <script>alert(1)</script> and see if it shows up in the response.

This works well for a specific category of bugs — the kind where the vulnerability is the input/output pattern. Send bad thing, get bad result, flag it.

Here's a simplified version of what a rule-based check looks like under the hood:

# Traditional scanner: SQL injection check
def check_sqli(url, param):
    payloads = [
        "' OR '1'='1",
        "1; DROP TABLE users--",
        "' UNION SELECT null,null,null--",
        "1' AND SLEEP(5)--",
    ]
    for payload in payloads:
        response = requests.get(url, params={param: payload})
        if "sql" in response.text.lower():
            return Finding(severity="HIGH", type="SQL Injection")
        if "syntax error" in response.text.lower():
            return Finding(severity="HIGH", type="SQL Injection")
        if response.elapsed.seconds >= 5:
            return Finding(severity="HIGH", type="SQL Injection (time-based)")
    return None

It's a loop over known payloads with pattern-matched responses. Effective, battle-tested, and fundamentally limited to what's in the list.

Now consider this endpoint:

// app/api/invoices/[id]/route.ts
export async function GET(request: Request, { params }) {
  const invoice = await db.invoice.findUnique({
    where: { id: params.id },
  });
  return NextResponse.json(invoice);
}

There's nothing to inject here. Prisma parameterizes the query automatically. No XSS vector in a JSON response. A rule-based scanner sends its payloads, gets clean responses, and moves on. Green check. Endpoint looks fine.

But the endpoint is wide open. Any logged-in user can read any invoice by changing the ID. The problem isn't in the input handling — it's in the missing logic. There's no ownership check. And a pattern-matching scanner can't flag the absence of something.


What changes when the scanner can read

An LLM-powered scanner doesn't just send payloads and match responses. It reads the page, understands what the application does, and reasons about what should and shouldn't be possible.

When Flowpatrol's engine encounters that invoice endpoint, the process looks more like this:

  1. It reads the API response. The endpoint returns an invoice with fields like userId, amount, lineItems, and billingAddress. The LLM recognizes this as user-owned financial data.

  2. It asks a question. "This invoice belongs to user usr_abc. I'm authenticated as user usr_xyz. Should I be able to see this?" The answer is obviously no — but an LLM is the first kind of scanner that can even formulate that question.

  3. It tests the hypothesis. It makes a second request using a different user's session and the same invoice ID. If the response comes back with data, that's a confirmed access control failure.

  4. It reports with context. Instead of "parameter reflection detected," the finding says: "Any authenticated user can read any other user's invoice by changing the ID in the URL. The endpoint returns billing address, payment details, and line items without verifying ownership."

The difference isn't just accuracy. It's the class of bugs that become findable.


Three things LLMs catch that rules can't

1. Business logic flaws

This is the big one. Business logic bugs aren't about bad input handling — they're about the application doing something it shouldn't, even when all inputs are perfectly valid.

Here's a real example. A SaaS app lets users create invoices with line items. Each line item has a description, quantity, and unit price. The API looks like this:

// POST /api/invoices
{
  "client_id": "client_123",
  "line_items": [
    { "description": "Consulting", "quantity": 10, "hours": true, "rate": 150 },
    { "description": "Expenses",   "quantity": 1,  "hours": false, "rate": -500 }
  ]
}

That negative rate on the second line item? The API accepts it. The invoice total comes out to $1,000 instead of $2,000. If the billing system processes refunds based on invoice totals, a user could generate a negative invoice and get money back.

A rule-based scanner never tries this. It doesn't know what a "rate" field means. It doesn't know that negative values are semantically wrong in this context. It sends SQL injection payloads to the rate field, gets back a clean response, and moves on.

An LLM reads the endpoint, recognizes it as an invoicing system, and thinks: "What happens if I send a negative amount?" That's not a pattern match. That's reasoning.

2. Multi-step access control failures

Some access control bugs only show up when you chain actions together. Consider this sequence:

# Step 1: Create a team (as team admin)
POST /api/teams
{"name": "My Team"}
# Returns: {"id": "team_abc", "role": "admin"}

# Step 2: Invite a member (as team admin)
POST /api/teams/team_abc/members
{"email": "member@example.com", "role": "viewer"}

# Step 3: The invited member changes their own role
PUT /api/teams/team_abc/members/me
{"role": "admin"}
# Returns: 200 OK — member is now admin

Each endpoint works correctly in isolation. The team creation is fine. The invitation is fine. The profile update is fine — it only updates the requesting user's own record, so it's not an IDOR in the traditional sense.

But the member endpoint doesn't check whether the role field should be writable by the member themselves. A viewer just promoted themselves to admin. This is a vertical privilege escalation through a legitimate endpoint, and it only becomes visible when you understand the relationship between roles, permissions, and who should be able to change what.

Rule-based scanners test endpoints in isolation. They don't maintain a mental model of "this user is a viewer, and viewers shouldn't be able to set their role to admin." LLMs do.

3. Context-dependent data exposure

Some data is only sensitive in context. An API that returns {"email": "user@example.com"} might be fine on a public profile endpoint but a serious leak on a search endpoint where you can enumerate every user in the system.

# Public profile — this is expected
GET /api/users/usr_abc/profile
{"name": "Jane", "email": "jane@example.com", "bio": "Builder"}

# Search endpoint — this shouldn't return email
GET /api/users/search?q=jane
{"results": [{"name": "Jane", "email": "jane@example.com", "id": "usr_abc"}]}

A rule-based scanner sees two endpoints returning similar JSON. No red flags. An LLM recognizes that the search endpoint lets any user enumerate email addresses for every user in the system — and that's a finding.

Diagram showing how an LLM scanner reasons about endpoint context while a rule-based scanner treats each response identically


"But can't I just write better rules?"

Yes, to a point. You can write a rule that checks for IDOR by making the same request with two different auth tokens and comparing responses. Some advanced scanners do this.

But the rules get complicated fast. You need to:

  • Know which endpoints are user-scoped vs. public
  • Know which fields in the response are sensitive
  • Know what "ownership" means for each resource type
  • Know which HTTP methods should require ownership vs. just authentication
  • Handle cases where the response format changes based on the requester's role

For every application, these rules are different. An invoice app has different ownership semantics than a social media app. A healthcare app has different sensitivity rules than an e-commerce store. Writing rules that generalize across all of these is the same problem as writing a program that "understands" what software does — which is exactly what LLMs are good at.

Here's the practical difference. To add IDOR detection to a rule-based scanner, you'd write something like:

# Rule-based IDOR check — brittle, limited
def check_idor(endpoint, user_a_token, user_b_token):
    # Get a resource as User A
    resp_a = requests.get(endpoint, headers={"Authorization": f"Bearer {user_a_token}"})
    if resp_a.status_code != 200:
        return None

    # Try the same resource as User B
    resp_b = requests.get(endpoint, headers={"Authorization": f"Bearer {user_b_token}"})

    # If User B gets the same data, flag it
    if resp_b.status_code == 200 and resp_b.json() == resp_a.json():
        return Finding(severity="HIGH", type="IDOR")
    return None

This catches the simple case. But it misses partial data leaks (where User B gets some of User A's fields), it doesn't test write operations, it can't detect that a search endpoint is leaking data that should be private, and it has no concept of why the access is wrong — just that two responses matched.

The LLM approach doesn't need a pre-written rule for each pattern. It reads the response, understands what the data represents, and evaluates whether the access makes sense given the requester's identity. When the rules are "understand the application," an LLM is the right tool.


Where LLMs are not better

We'd be lying if we said LLMs are better at everything. They're not. Here's where traditional scanners still win:

Speed on known patterns. ZAP can blast through 10,000 SQLi payloads in the time it takes an LLM to reason about one endpoint. For known vulnerability signatures, brute force is faster.

Deterministic results. Run ZAP twice, get the same output. LLMs can produce slightly different results on repeated scans. We mitigate this with structured prompting and validation layers, but it's a real tradeoff.

Deep injection testing. Rule-based scanners carry decades of injection payloads — WAF bypasses, encoding tricks, polyglot payloads. LLMs know about these techniques conceptually, but a curated payload list is more thorough for pure injection testing.

This is why Flowpatrol doesn't throw out traditional scanning. We use both. Pattern matching handles the known signatures. The LLM handles the reasoning — the access control checks, the business logic testing, the "does this make sense" analysis that no rule can encode.


What this looks like in practice

When you paste a URL into Flowpatrol, here's what actually happens:

First, the engine reads your app. Not just the HTML — the navigation structure, the forms, the API endpoints, the authentication flow. It builds a map of what your app does and how it works.

Then it thinks about what could go wrong. For each endpoint, it generates test cases based on the specific context. An invoicing endpoint gets tested for negative amounts and cross-user access. A file upload endpoint gets tested for type bypass and path traversal. A user profile endpoint gets tested for mass assignment and privilege escalation.

Then it runs the tests and evaluates the results. Not against a list of known-bad patterns — against its understanding of what the application should do. If the response doesn't match what a secure application would return, it's a finding.

Finally, it writes a report in plain language. Not "parameter reflection in HTTP response." Instead: "User B can read User A's invoices by changing the ID in the URL. The endpoint at /api/invoices/[id] returns billing address, line items, and payment status without checking ownership."

The whole thing takes minutes. And it catches the stuff that would otherwise require a human pentester sitting down with your app for a day.


What you should do with this

You don't need to understand how LLMs work to benefit from this. Here's the practical takeaway:

  1. Don't rely on one type of scanner. If you've only run ZAP or Nikto against your app, you've checked for injections and misconfigurations. That's necessary but not sufficient. The access control and business logic bugs are still there, untested.

  2. Test with multiple users. The simplest security test you can run manually: log in as User A, copy a resource URL, log in as User B, paste the URL. If User B sees User A's data, you have a problem. Do this for every endpoint that returns user-specific data.

  3. Check your write endpoints too. Read access is one thing, but can User B modify User A's data? Can a regular user change their role to admin? Test your PUT, PATCH, and DELETE routes with the same two-user approach.

  4. Scan before you ship. Flowpatrol runs both pattern matching and LLM-powered analysis on every scan. Paste your URL, get a report that covers injections and logic flaws. Five minutes, and you know exactly where you stand.

The tools you build with are getting smarter every month. The tools that test your builds should be too.

Back to all posts