We ran 300 White-Box Attacks against BabySea before launch

Date: May 3, 2026
By: Randy Aries Saputra


The WAF is the shield. The invariants are the fortress.


Before launch in March 2026, I wanted to know what would break.

Not from a checklist. Not from a scanner. Not from a happy-path test. I wanted to test BabySea the way it could fail in production.

BabySea is an execution control plane for generative media. Developers send image and video workloads through one API, and BabySea handles the lifecycle underneath: authentication, request validation, provider selection, failover, credit reservation, credit settlement, webhooks, file delivery, usage logs, health checks, and regional execution across US, EU, and APAC.

That means the dangerous bugs are not ordinary bugs.

A failure in BabySea could become a credit leak, a duplicate charge, a stuck reservation, a cross-account data exposure, a forged webhook, an SSRF hole, a broken region boundary, or provider abuse.

So I decided to attack it before customers could.

The night I asked Opus to break BabySea

Late at night, before launch, I did something uncomfortable.

I called Opus 4.6 and gave it one job: understand BabySea well enough to attack it. Not from the outside. Not like a random scanner. Not by guessing endpoints and throwing generic payloads at the API.

I wanted a white-box red-team exercise.

So first, I gave it time.

For around 30 minutes, Opus explored the BabySea codebase, UI, API routes, database schema, billing logic, generation lifecycle, regional setup, and infrastructure boundaries. I wanted it to understand the system the way an internal security engineer would understand it.

BabySea was no longer just a product experiment at that point. It had become infrastructure: API keys, async image and video workloads, provider failover, credit reservation, credit settlement, webhooks, file storage, usage logs, and regional deployments across US, EU, and APAC.

After the exploration, Opus came back with the line:

"I'm ready, Captain."

That was when I gave it the real instruction.

I told it to act as a senior offensive security engineer and red-team lead. The objective was to design a rigorous, authorized, 300-attempt security assessment against BabySea's live production infrastructure before launch.

The rules were strict:

  • Three regions: US, EU, and APAC.
  • Three real regional API keys.
  • Exactly 100 attacks per region.
  • The same methodology repeated across all three regions.
  • Real production API surfaces.
  • Real generation lifecycle paths.
  • Real credit and billing state.

I did not ask it to be gentle.

I asked it to test the things that actually break infrastructure: TOCTOU credit races, concurrent deduction bypasses, generation ID spoofing, HTTP request smuggling, protocol desync, strict schema bypasses, Zod type confusion, JSON smuggling, deeply nested payloads, SQL and NoSQL injection, cross-region token replay, JWT and HMAC forgery, BOLA, IDOR, SSRF through image URLs, webhook signature spoofing, DLQ abuse, cron endpoint probing, and multi-route DoS pressure.

The point was simple:

If BabySea could break, I wanted it to break in my own controlled test first.

Then Opus designed the attack suite.

It did not produce one random script. It produced a full attack harness.

At the center was an orchestrator that could run the complete 300-attack suite, or narrow execution by region or category. It generated structured JSON results and an executive summary, with support for dry runs, single-region tests, and single-category tests.

The attacks were divided into six categories.

What one safe failure looked like

The public report does not include raw payload scripts or reusable exploit automation, but the original run gives one useful sanitized example.

CAT1-019 tested whether a caller could override account identity with X-Account-Id while using a valid API key. It returned 200, which can look strange if the only metric is "blocked or not blocked." It was actually the safe result: the header was ignored, and the response was scoped to the account derived from the API key.

Text
request credential -> bcrypt API key lookup
verified API key -> account_id
X-Account-Id -> ignored
account response -> verified account only

The same ownership boundary showed up in the BOLA tests. CAT1-011, CAT1-012, and CAT1-013 attempted to access, delete, and cancel non-owned objects. Those returned 404 BSE2011, because the object was not visible inside the authenticated account scope.

CAT2-003 tested the financial version of the same idea: triple cancel/refund pressure. In APAC, the request reached application logic after the generation had already succeeded and returned 409 BSE2012; in US and EU, rate limiting stopped it earlier with 429. Both outcomes were safe because no duplicate refund was created.

That is the practical difference between "blocked" and "expected-safe." A request can reach application logic and still be safe if identity, ownership, lifecycle, and ledger invariants refuse to move into an invalid state.

CAT1: Authentication & Authorization

The first category attacked identity and tenant isolation.

It tested missing authentication, malformed Bearer tokens, SQL injection inside token strings, null-byte tricks, cross-region replay, random keys with valid prefixes, API keys in query params, BOLA against content lookup/delete/cancel, CRLF header injection, fake JWTs, spoofed account headers, IP spoofing, key-prefix timing, and oversized Authorization headers.

The question was not just whether BabySea returned 401 for a missing key.

The deeper question was:

Can anyone trick BabySea into trusting the wrong identity?

For a multi-tenant API, identity must not come from user-controlled headers. It must not come from route parameters. It must not come from a region-mismatched token. It must come from verified credential material, mapped to the right account, in the right region.

If a caller can influence account identity through X-Account-Id, a path parameter, a query string, or a replayed key from another region, the system is not multi-tenant. It is pretending.

That is why CAT1 mattered.

CAT2: Financial & State Logic

The second category attacked the part I cared about most: the credit and generation state machine.

It tested concurrent generation races, heavier TOCTOU pressure, triple cancel/refund races, spoofed generation IDs, cross-account cancel attempts, negative output counts, zero output counts, integer overflow pricing inputs, empty provider order, estimate/generate bait-and-switch, delete-then-cancel, concurrent estimate/generate timing, and invalid video duration or resolution.

The question was:

Can anyone break the ledger?

AI generation billing is not checkout. It is settlement.

A user starts a generation. Credits are reserved. The provider may succeed, fail, timeout, retry, call back late, or never call back. The user may cancel. The client may retry. Another provider may take over. The final cost may differ from the estimate. The state machine has to settle correctly anyway.

If credit reservation is implemented as read-then-write, concurrent requests can spend the same balance.
If refund logic is not idempotent, one generation can be refunded multiple times.
If cancel does not verify ownership, one account can manipulate another account's billing state.
If estimate and generate share mutable pricing state incorrectly, a customer can estimate cheap and execute expensive.

This is where many systems fail because billing logic looks simple until async execution, retries, and concurrency enter the room.

For BabySea, this category was the real test.

CAT3: Input Validation & Injection

The third category attacked the request body boundary.

It tested SQL injection, NoSQL injection, stored XSS, SSTI, command injection, huge prompts, deeply nested JSON, 10,000-key payloads, null bytes, duplicate JSON keys, XML body confusion, form-encoded body bypasses, empty bodies, array bodies, prototype pollution, Zod type confusion, model path traversal, invalid provider names, provider order object confusion, oversized provider lists, RTL spoofing, oversized JSON, and Log4Shell-style probes.

The question was:

Can malformed input reach the execution layer?

For a generative-media API, inputs are not uniform. Every model has different parameters. Some take images. Some take arrays. Some take durations. Some take resolutions. Some take ratios. Some take provider order. Some take audio flags. Some take model-specific fields.

That makes validation a first-class security boundary.

If malformed input reaches provider execution, the system has already lost control.

The request boundary must reject unsupported types, oversized payloads, unknown fields, invalid provider order, schema confusion, path traversal, and body parser tricks before they reach billing or provider code.

CAT3 was about proving that the execution layer was not accidentally trusting user input.

CAT4: Edge & Protocol Abuse

The fourth category attacked HTTP itself.

It tested CL.TE smuggling, TE.CL smuggling, TE.TE obfuscation, Content-Length mismatch, Content-Length zero with body, PUT/PATCH/TRACE on unsupported routes, method override headers, Host header injection, X-Forwarded-Host injection, OPTIONS enumeration, HTTP/1.0 downgrade, encoded traversal, and HEAD probing.

The question was:

Can protocol ambiguity confuse the edge or origin?

Most application tests focus on JSON bodies. Real production systems also get attacked through protocol behavior.

Request smuggling is dangerous because it targets disagreement between layers: proxy, edge, runtime, and application server. If one layer interprets request boundaries differently from another, an attacker may smuggle a second request through the first one.

Host header injection is dangerous because many applications use host information to build links, redirects, callbacks, and webhook URLs.

Method override is dangerous because a harmless-looking POST can become a destructive DELETE if middleware honors the wrong header.

This category tested whether BabySea's edge and runtime behavior could be confused into reaching unintended application paths.

CAT5: SSRF & File Access

The fifth category attacked the media input surface.

It tested localhost URLs, 127.0.0.1, cloud metadata endpoints, internal IP ranges, file://, gopher://, parser confusion, path traversal in generation IDs, predictable generation ID access, and redirect-chain SSRF.

The question was:

Can user-supplied media URLs become internal network probes?

This is critical for generative-media infrastructure.

Image and video workloads often accept external input files. That means the platform may validate, inspect, proxy, fetch, or forward user-supplied URLs. If that path is not controlled, an image input can become SSRF.

The dangerous cases are not only obvious localhost URLs. Attackers also use parser confusion, redirects, private IP ranges, metadata endpoints, non-HTTP schemes, and path traversal attempts.

If file validation or provider preparation touches a URL, it must treat that URL as hostile.

CAT5 tested whether BabySea did that.

CAT6: Webhook & Infrastructure

The sixth category attacked the async infrastructure layer.

It tested forged provider callbacks, empty signatures, missing signatures, spoofed provider completions, stale and future timestamps, massive webhook bodies, cron brute force, unauthenticated cron access, information disclosure probes, burst rate-limit tests, rate-limit scope checks, webhook SQL injection, DLQ stress, and concurrent blasts across multiple routes.

The question was:

Can anyone forge state transitions or exhaust the infrastructure?

For an async execution system, webhooks are not just notifications. They are part of the state machine.

A forged webhook could become a false success, false failure, false refund, false charge, or fake output file. A replayed webhook could apply old state to a new lifecycle. A weak cron endpoint could trigger cleanup or retention behavior. A DLQ without limits could become a storage exhaustion surface.

This category tested the infrastructure paths around the main API, not only the public generation route. That was the attack design. At that point, I was not trying to prove BabySea was safe.

I was trying to find where it was unsafe.

The result

The full assessment executed 300 attack attempts across US, EU, and APAC. Each region received 100 attempts using the same methodology.

The public result:

MetricResult
Total attack attempts300
Regions tested3
Attack categories6
Directly blocked attempts213
Expected-safe behavior87
Unexpected results0
Runtime errors0
Critical findings0
High findings0
Medium findings0
Low findings0

The category breakdown mattered because each class mapped to a different control boundary:

CategoryFocusBlockedExpected-safeWhat held
CAT1Authentication and authorization519API key identity, RLS, header ignore
CAT2Financial and state logic2421Rate limits, lifecycle checks, ledger safety
CAT3Input validation and injection4827Zod schemas, body parsing, model constraints
CAT4Edge and protocol abuse3015Method rejection, edge/runtime protocol rules
CAT5SSRF and file access300URL and file-access controls
CAT6Webhook and infrastructure abuse3015Webhook boundaries and infrastructure guards

The assessment produced no unexpected findings within the tested scope.

That does not mean BabySea is perfectly secure. No serious security report should claim that. It means this attack suite did not produce a successful bypass, state corruption, cross-account access issue, invalid refund, SSRF exposure, forged webhook transition, or protocol-level unsafe behavior during the test.

The most important result was not the number 300.
The most important result was the ledger.

The ledger held

For BabySea, credit settlement is one of the core invariants.

Every successful generation should create a clean lifecycle:

Text
reserve ➜ provider execution ➜ charge

If the generation fails or is safely canceled, the system should resolve through refund logic without double refunding or leaking reserved credits.

During the assessment, successful generations were not just mocked. The test created real workload pressure and then checked the resulting credit state.

The public summary:

RegionTest generationsReserve entriesCharge entriesLedger result
US101010Balanced
EU101010Balanced
APAC111111Balanced

APAC completed one additional generation because a rate-limit timing window allowed one more request through before blocking. That was not a security issue. The important part was that the ledger still balanced.

Every successful generation had a matching reserve and charge.

No orphaned reservations.
No double charges.
No invalid refunds.
No ownership mismatch across account, API key, and generation state.

That result mattered more than a clean scanner score. It validated, within the tested scope, that hostile traffic did not corrupt the business-critical state machine.

Why expected-safe behavior is not the same as a vulnerability

In the report, not every attack was classified as "blocked." Some were classified as expected-safe behavior. That distinction matters.

For example, if a request reaches the application and returns 404 for a non-owned generation, that is not a vulnerability. That is the expected safe behavior. If a spoofed account header is ignored and the API returns the authenticated account's own data, that is also expected safe behavior.

Security testing is not just counting 403 responses.

Sometimes the safest response is:

  • 401 for missing or invalid auth
  • 404 for non-owned objects
  • 400 for invalid input
  • 409 for invalid lifecycle state
  • 429 for rate-limited pressure
  • method rejection for unsupported routes
  • timeout or close for malformed protocol behavior

The important question is whether the system entered an unsafe state.

In this assessment, it did not.

Cross-region consistency

BabySea runs across US, EU, and APAC. So I wanted the same attack methodology repeated across all three regions.

Multi-region systems can drift in subtle ways:

  • one region has a stale migration
  • one region has a different environment variable
  • one region has weaker route config
  • one region has a missing secret
  • one region has different rate-limit timing
  • one region has different provider behavior
  • one region has different edge behavior

The test found five timing-dependent inconsistencies.
They were not security-relevant.

Most came from rate-limit window alignment and generation completion timing. In some cases, one region blocked earlier at the rate-limit layer, while another region let the request reach deeper validation. When the request reached deeper validation, it was still rejected or safely handled.

That is acceptable.

Perfectly identical timing is not the goal.
Equivalent security posture is the goal.

The important result was that each region preserved the same invariants: identity, tenancy, validation, SSRF defense, webhook boundaries, and credit settlement.

The twist: Cloudflare was not fully enforcing yet

Two days after the assessment, I realized something uncomfortable.

The intended traffic path was:

Text
Cloudflare WAF/API Shield
  -> Vercel Edge
  -> Next.js App Router
  -> Application and database invariants

But during the test window, the production API traffic was not fully proxied through Cloudflare yet. The WAF and API Shield rules I expected to be in front of the assessment were not actually enforcing on that path.

The effective path was closer to:

Text
Vercel Edge
  -> Next.js App Router
  -> Application and database invariants

That mattered because the edge still did some platform-level work. Oversized authorization headers died at the Vercel edge with 494. Conflicting Transfer-Encoding behavior was rejected before normal request processing. Some malformed protocol cases never became application events at all.

But the Cloudflare WAF/API Shield layer was not the reason the assessment came back clean.

The blocker breakdown was more interesting:

Blocking layer or outcomeCountShare
Application validation6321.0%
Application auth8127.0%
Method not allowed5719.0%
Edge rejection62.0%
Client/protocol rejection62.0%

The remaining 87 attempts were not vulnerabilities. They were expected-safe outcomes: requests that either returned safe application responses or reached controlled logic without causing an unsafe state transition.

That changed what the report actually proved. It was not a story about Cloudflare blocking everything before it reached BabySea. It was a story about the application, platform edge, rate limits, and database invariants holding when the custom Cloudflare layer was absent.

Cloudflare was not the reason the ledger balanced.
Cloudflare was not the reason BOLA attempts failed.
Cloudflare was not the reason cross-region keys were rejected.
Cloudflare was not the reason malformed provider order failed validation.
Cloudflare was not the reason forged webhooks failed.
Cloudflare was not the reason SSRF attempts did not expose internal resources.

Those defenses lived across BabySea's authentication, authorization, validation, rate limiting, Supabase RLS, webhook verification, Vercel edge behavior, protocol handling, and credit-settlement layers.

After that discovery, I finished the Cloudflare proxy path and hardened WAF/API Shield as an outer layer. But I did not treat it as the thing that made the system safe. The application and database still had to reject bad auth, preserve tenant isolation, prevent double spend and double refund, reject forged webhooks, and fail safely on malformed input.

That became the real lesson:

The WAF is the shield. The invariants are the fortress.

Defense-in-depth is not a slogan

A lot of teams say "defense in depth."

In practice, many systems still depend on one layer doing too much.

A WAF blocks bad payloads, so application validation gets weak.
Auth middleware checks identity, so database policies get lazy.
Application code checks balances, so SQL invariants are skipped.
Webhook handlers check duplicates in code, so database idempotency is missing.
Global rate limits exist, so per-account concurrency is ignored.

That is not defense-in-depth.
That is defense by hope.

For BabySea, each layer needs to own a separate guarantee.

API keys own identity

The caller does not get to declare who they are through headers or route parameters. Account identity comes from verified API key material.

Region boundaries own sovereignty

A key from one region should not authenticate against another region's data plane.

RLS owns tenant isolation

The application should scope queries correctly, but the database should still enforce account boundaries.

Validation owns request shape

Malformed input should not reach provider execution, credit logic, or storage logic.

Rate limits own request pressure

Burst traffic should not become uncontrolled API load.

Concurrency caps own in-flight workload pressure

A customer should not be able to create unbounded parallel generations simply because the requests are individually valid.

Credit functions own settlement

The database should enforce reserve, charge, refund, and idempotency behavior. Billing invariants should not live only in application memory.

Webhook verification owns state transitions

A callback should not be trusted because it hits the right URL. It must be signed, verified, scoped, and mapped to the correct execution state.

Edge controls own early rejection

The WAF, API Shield, CDN, and runtime edge should reduce abuse before it reaches the app. But they should not be the only line of defense.

That is the model I trust.

What this means for AI infrastructure

Generative AI infrastructure has a different failure shape from normal SaaS. A normal SaaS API often reads and writes relatively simple state.

A generative-media execution layer coordinates multiple unstable systems:

  • users
  • API keys
  • prompts
  • input files
  • model schemas
  • providers
  • provider queues
  • webhooks
  • storage
  • billing
  • rate limits
  • retries
  • regions
  • observability

Every provider behaves differently.
Every model has different parameters.
Every async job can fail in different ways.
Every retry can create duplicate pressure.
Every webhook can arrive late, twice, or never.
Every file input can be a security boundary.
Every billing event must settle correctly.

That is why BabySea exists as infrastructure.

The product is not just "one API for models." The deeper product is predictable execution.

When a developer sends a workload, the system should know how to validate it, estimate it, reserve credits, pick a provider, fail over, store outputs, deliver events, settle credits, expose logs, and recover safely.

That execution layer is where the hard problems live.

So that is where the test focused.

What I did not publish

I published a sanitized public report, not the raw internal artifact.

The public version includes:

  • assessment scope
  • methodology
  • categories
  • result summary
  • region summary
  • control mapping
  • credit-ledger integrity summary
  • limitations
  • defense-in-depth lessons

The public version does not include:

  • real API keys
  • real account IDs
  • real generation IDs
  • raw payload scripts
  • reusable exploit automation
  • internal database dumps
  • sensitive route brute-force details
  • secrets, tokens, or private operational identifiers

That distinction matters.

Security transparency should help customers understand posture. It should not hand attackers a replay guide.

The public result table

Here is the sanitized public summary.

The full sanitized public report is available in the BabySea security reports repository: 2026-03-13 White-Box Red-Team Assessment.

CategoryFocusAttemptsPublic result
CAT1Authentication and authorization60No unexpected findings
CAT2Financial and state logic45Credit ledger remained balanced
CAT3Input validation and injection75Malformed inputs rejected or safely handled
CAT4Edge and protocol abuse45No unsafe request desync reached application state
CAT5SSRF and file access30SSRF and file-access abuse attempts blocked
CAT6Webhook and infrastructure abuse45Forged callbacks and infra probes safely handled
Total3000 unexpected findings

And the region summary:

RegionTotal attemptsDirectly blockedExpected-safe behaviorUnexpectedErrors
US100712900
EU100712900
APAC100712900
Total3002138700

And the ledger summary:

RegionReservesChargesBalance checkResult
US10100 differenceBalanced
EU10100 differenceBalanced
APAC11110 differenceBalanced

The public conclusion:

Within the tested scope, BabySea's execution layer handled the 300-attempt white-box assessment without unexpected findings. More importantly, the core state invariants held under adversarial pressure.

The real takeaway

I did not run this test to get a nice number. I ran it because BabySea sits in the execution path.

If the system fails, it can fail in ways that matter: credits, files, webhooks, providers, accounts, regions, and customer trust. The result gave me confidence, but not complacency. Security is not a one-time badge. Every new provider, model, route, billing feature, region, webhook event, or SDK behavior creates new surface area.

So the right posture is not "we are secure now."

The right posture is:

We know which invariants matter, and we keep testing them.

For BabySea, those invariants are clear:

  • identity must come from verified API keys
  • account boundaries must hold
  • regional replay must fail
  • malformed input must stop at validation
  • SSRF must not reach internal resources
  • provider callbacks must not be forgeable
  • retries must not create duplicate work
  • cancel/refund flows must be idempotent
  • credit settlement must remain balanced
  • edge controls must help, but not become the only defense

That is what the red-team exercise tested.
That is what held.

The WAF is the shield.
The invariants are the fortress.