How BabySea built a regional event spine on Confluent and AWS

Date: May 11, 2026
By: Randy Aries Saputra

TL;DR. BabySea moves execution facts through a regional Confluent + AWS event spine, but Postgres stays the source of truth and the API never waits on the stream. The result is an event-driven system that is not event-dependent.

This post is not a Kafka comparison, a Confluent advertisement, or a generic streaming tutorial. It is the architecture and boundary discipline behind one specific production decision: how an AI execution control plane can stream operational facts without making customer requests more fragile.

Event infrastructure is useful only if it makes the product more capable without making the request path more fragile.

BabySea is the execution control plane for generative media: one API across image and video models, inference providers, sovereign regions, credit settlement, webhook delivery, provider failover, and adaptive routing. Every generation creates operational facts. A request is accepted. A provider is tried. A timeout happens. Credits are reserved or refunded. A webhook is delivered. A ranking system learns from the outcome.

Those facts need to move.

But they should not decide whether a customer request can run.

That is why we built BabySea's regional event spine on Confluent and AWS, with regional Postgres as the operational source of truth, Confluent Cloud Kafka and Schema Registry as the event contract layer, Databricks Spark Structured Streaming and Delta Lake as the governed consumer path, and Upstash as the hot-path cache for provider-ranking artifacts.

Confluent moves schema-valid execution facts. It does not own product correctness.

That boundary is the whole architecture.

The source of truth stays in Postgres

The easiest way to make an event-driven system brittle is to let the stream become the source of truth for operational state.

BabySea does not do that.

When a customer starts a generation, the authoritative state lives in regional Postgres. That is where generation lifecycle, provider attempts, credit invariants, account boundaries, tenant ownership, and webhook intent are protected. Confluent receives facts after the operational transaction has produced durable intent.

The system contract is deliberately simple:

Text

Postgres owns correctness.
Confluent Cloud Kafka transports facts.
Confluent Schema Registry enforces event contracts.
AWS runs the publisher outside the request path.
Databricks Spark Structured Streaming lands validated events in Delta Lake.
Upstash serves compact ranking artifacts.
The API keeps serving when downstream systems degrade.

That last line matters most. Confluent Kafka is allowed to lag. Confluent Schema Registry is allowed to be unavailable. The Databricks Structured Streaming job is allowed to pause. A cache export is allowed to be stale. None of those conditions should stop the API from accepting a valid generation request or settling its lifecycle correctly.

The event spine is allowed to be behind. It is not allowed to become the reason customer execution fails.

The outbox is the boundary between local correctness and distributed facts

BabySea does not publish directly to Kafka from the request handler.

Instead, the application writes event intent into a transactional outbox in the same regional database that commits the operational state. A separate publisher claims those rows, validates them, publishes to Confluent, and marks them published only after Kafka acknowledgement.

Conceptually, the path looks like this:

Text

Customer request
  -> API validates and commits business state
  -> database records event intent in the outbox
  -> AWS ECS publisher claims pending rows
  -> publisher validates envelope, topic, key, type, and schema
  -> publisher sends to Confluent
  -> row is marked published after acknowledgement

The transaction shape is the important part. The outbox write happens with the product write, not after it as a best-effort side effect:

Sql

begin;

-- The product row is the source of truth.
insert into public.file_assets (
  id,
  account_id,
  model_identifier,
  generation_status,
  generation_region
) values (
  :generation_id,
  :account_id,
  :model_identifier,
  'processing',
  :region
);

-- The event is durable intent to publish, not the authority.
select public.enqueue_event_outbox(
  p_region := :region,
  p_topic := :generation_topic,
  p_event_type := 'generation.requested.v1',
  p_partition_key := :generation_id,
  p_payload := :safe_event_payload
);

commit;

The publisher then owns distribution. Its job is intentionally narrow:

TypeScript

for await (const batch of claimOutboxBatches(region)) {
  for (const row of batch) {
    try {
      assertRegion(row, region);
      assertAllowedTopic(row.topic);
      assertSafePartitionKey(row.partition_key);
      assertAllowedEventType(row.event_type);
      assertNoSensitiveFields(row.payload);

      const encoded = await schemaRegistry.encode(row.topic, row.payload);

      await producer.send({
        topic: row.topic,
        messages: [
          {
            key: row.partition_key,
            value: encoded,
          },
        ],
      });

      await markEventOutboxPublished(row.id);
    } catch (error) {
      await markEventOutboxFailed(row.id, classifyPublishError(error));
    }
  }
}

If the publisher is down, rows remain pending. If Confluent is unavailable, rows retry later. If Schema Registry cannot be reached, the publisher fails the row safely and backs off. The request path has already committed the product state and does not wait on the telemetry plane.

The API produces durable intent. The publisher performs distribution. Those are different jobs.

Regions are part of the product contract

BabySea operates across US, EU, and JP regional surfaces. The event system follows that boundary instead of collapsing everything into one global stream.

Each region has its own Confluent topic surface, Schema Registry subjects, publisher deployment, Databricks Structured Streaming job, Delta bronze table, and routing artifacts. That separation is not cosmetic. Provider behavior is regional. Latency is regional. Cost and reliability profiles are regional. Customer data boundaries are regional. A generation event produced in one region should not quietly become an input to another region's operational loop.

The regional pattern is straightforward:

Text

US region -> US event spine -> US lakehouse catalog -> US ranking cache
EU region -> EU event spine -> EU lakehouse catalog -> EU ranking cache
JP region -> JP event spine -> JP lakehouse catalog -> JP ranking cache

Regional architecture is not a deployment detail for BabySea. It is part ofthe execution contract.

This is why our topics, schemas, ingestion jobs, and cache artifacts are region-scoped. The exact resource identifiers are an implementation detail; the invariant is the important part.

Schema Registry keeps the stream from becoming JSON entropy

An event stream without schema discipline eventually becomes distributed JSON entropy.

BabySea uses Confluent Schema Registry so event payloads evolve as contracts, not as ad hoc logs. The event families cover the parts of execution that downstream systems need to observe:

generation lifecycle
provider attempt outcomes
credit reservation and settlement events
webhook delivery outcomes
provider health updates
dead-letter records for contract failures

The envelope matters as much as the payload. Every event needs stable identity, region, type, source, timestamp, and partitioning semantics. The payload then carries product facts needed by analytics, routing intelligence, operational inspection, or downstream consumers.

A generation event payload is shaped like an operational contract, not a log dump:

JSON

{
  "id": "evt_01hx_sanitized",
  "type": "generation.completed.v1",
  "source": "babysea.api",
  "region": "us",
  "occurred_at": "2026-05-10T21:42:00.000Z",
  "partition_key": "gen_01hx_sanitized",
  "data": {
    "generation_id": "gen_01hx_sanitized",
    "account_id": "acct_sanitized",
    "model_identifier": "openai/gpt-image-1-mini",
    "media_type": "image",
    "status": "succeeded",
    "provider_used": "openai",
    "credit_outcome": "charged"
  }
}

That example is intentionally boring. It is useful downstream because it is stable, typed, regional, and safe to process. It does not contain the customer prompt, generated files, signed URLs, API keys, webhook secrets, or provider credentials.

The security boundary is equally important. The event spine should not become a second uncontrolled customer-data lake.

BabySea's event payload rules are intentionally strict:

Text

No prompts.
No raw files.
No output URLs.
No webhook secrets.
No API key values.
No unnecessary customer PII.

Those rules are enforced through schema design, publisher validation, review discipline, and production-readiness checks.

Observability should not require leaking customer intent or private artifacts into the event stream.

AWS runs the publisher outside the product API

The event publisher runs on AWS ECS Fargate, separate from the customer-serving API.

That separation gives us operational control. The publisher can be deployed, paused, scaled, rolled back, inspected, and hardened independently. It has one responsibility: drain the outbox safely.

The publisher validates before publish:

region
topic family
partition key
event type
envelope shape
schema compatibility
payload safety

It only marks a row as published after Confluent acknowledges the write. Everything else remains retryable or dead-lettered according to the failure mode.

The AWS surface around that worker is conventional but important: container runtime, secrets, IAM boundaries, networking, logs, and encrypted staging/checkpoint storage for downstream processing. Those details matter operationally, but the public architecture does not depend on exposing exact names, IDs, or endpoints.

The important behavior is this:

Text

API request succeeds.
Outbox row persists.
Publisher publishes when dependencies are healthy.
Downstream systems catch up.
Customer traffic continues.

That gives BabySea decoupling without hand-waving over reliability.

Confluent feeds Spark Structured Streaming, not the request path

Confluent's primary downstream consumer today is a Databricks Spark Structured Streaming job using the native Kafka source.

Regional Structured Streaming jobs read generation events from Confluent Cloud Kafka, decode Confluent JSON Schema Registry payloads, validate envelope fields, reject malformed or wrong-region records, deduplicate by event identity, and land validated rows into Delta Lake bronze tables.

The notebook logic is deliberately defensive. It treats the stream as an input that must prove it belongs in the regional lakehouse:

Python

from pyspark.sql import functions as F

raw = (
  spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", kafka_bootstrap)
  .option("subscribe", generation_topic)
  .option("startingOffsets", "latest")
  .load()
)

decoded = decode_confluent_json_schema(raw, schema_registry_subject)

valid_generation_events = (
  decoded
  .where(F.col("region") == F.lit(REGION))
  .where(F.col("type").isin(
    "generation.requested.v1",
    "generation.completed.v1",
    "generation.failed.v1",
  ))
  .where(F.col("id").isNotNull())
  .where(F.col("data.generation_id").isNotNull())
  .dropDuplicates(["id"])
)

(
  valid_generation_events.writeStream
  .format("delta")
  .option("checkpointLocation", checkpoint_path)
  .outputMode("append")
  .toTable(f"{CATALOG}.streaming.bronze_generation_events")
)

Malformed events, wrong-region events, and non-generation records do not become bronze rows. That keeps the lakehouse ingestion path honest: streaming bronze is a validated event substrate, not a mirror of every byte that arrives on Confluent Kafka.

That streaming bronze layer gives the lakehouse a governed event substrate. It can power operational analytics, quality checks, and learning loops without asking the operational database to serve every downstream workload directly.

But the same rule still applies:

Databricks Spark Structured Streaming is on the learning path. It is not on the customer request path.

BabySea's adaptive routing loop learns from execution outcomes. The Databricks lakehouse computes provider-performance and ranking artifacts by model and region after Confluent has delivered validated events. Upstash stores the compact serving artifact. The API reads that cached artifact when a customer asks for fastest routing.

The hot path does not read Kafka. It does not call Databricks. It does not compute online features. It reads a cache value, validates it, filters it against configured providers, and falls back to deterministic provider order if anything is missing, stale, malformed, or unavailable.

That code path is intentionally small:

TypeScript

async function resolveProviderOrder(input: {
  model: string;
  region: DeploymentRegion;
  providerOrder: string;
  configuredProviders: ProviderName[];
}) {
  const fallbackOrder = input.configuredProviders.join(', ');

  if (input.providerOrder !== 'fastest') {
    return input.providerOrder;
  }

  const ranking = await getCachedRanking(input.region, input.model);
  const allowed = new Set(input.configuredProviders);

  const rankedProviders = ranking?.providersRanked.filter((provider) =>
    allowed.has(provider),
  );

  return rankedProviders?.length ? rankedProviders.join(', ') : fallbackOrder;
}

The cache parser is written with the same posture as the event ingestion path: validate structure, return null on contract failure, and let the caller fall back. A malformed ranking is not an exception path for customer traffic. It is a signal that the intelligence layer should be inspected.

That is the serving contract:

Text

Confluent stream -> Databricks learning layer -> Upstash cache -> API routing

Not:

Text

API request -> Confluent Kafka read -> Databricks query -> routing decision

The event spine helps the system learn. It does not make every request wait for the learning system.

That boundary is what makes adaptive infrastructure safe.

Dead letters are evidence, not shame

Real event systems see malformed records, incompatible changes, regional mismatches, poison messages, transient network errors, and consumer bugs.

Pretending otherwise is not engineering. It is optimism.

BabySea treats dead-letter records as operational evidence. The DLQ is not the first line of defense; validation, schema compatibility, outbox retry state, and lag visibility come first. The DLQ exists for records that cannot safely move forward after those mechanisms have done their job.

The desired behavior is boring:

Text

Good events keep flowing.
Bad events are isolated.
Operators get evidence.
The API keeps serving.

That boring behavior is exactly what production systems need.

Readiness is checked, not remembered

Distributed systems drift.

One region gets a schema update. Another misses a notebook import. A legacy connector path accidentally returns. A fail-open fallback gets weakened during a refactor. A payload starts carrying a field that should never leave the operational database.

So BabySea treats production readiness as a checked property, not a shared memory.

Our guardrails verify the things that would hurt the system if they drifted:

regional implementation parity after normalization
DB-backed event payload fields match schema contracts
removed connector paths stay removed
Databricks Structured Streaming ingestion remains aligned with the deployed Delta tables
schedules are not accidentally paused by default
payload safety rules stay intact
fail-open behavior remains true for downstream failures

In distributed systems, consistency is not a vibe. It has to be checked.

The goal is not to prove that Kafka exists. The goal is to prove that adding Kafka did not weaken the execution path.

What we validated end to end

We validated the implementation across US, EU, and JP with image and video workloads, successful and failed generation states, outbox publication, Confluent topics and schemas, Databricks Spark Structured Streaming ingestion, provider-ranking export, and cache-backed fastest routing.

The validation flow was intentionally ordered instead of synthetic. Each region ran the same media plan: image workloads, video workloads, fastest routing, successful executions, and insufficient-credit failures where the account state made that path valid. No failed request was retried, because retrying would have hidden the failure invariant we wanted to observe.

The high-level result:

Check	Result
Regions	3
Ordered generation requests	21
Successful generations	17
Clean insufficient-credit failures	4
Generation outbox rows	14 per region
Confluent topic families present	6 per region
Schema subjects present	5 per region
Streaming bronze generation rows	14 per region
Failed request retries	0

The most useful validation cases were not only the happy paths.

One set of requests failed cleanly because the account did not have enough credits. Those requests still produced source-of-truth failure state and failure telemetry, but they did not reach provider execution, did not reserve credits, and did not charge credits.

Another provider path timed out in a separate fail-open check and the system failed over to the next eligible provider.

Those are the tests that matter. Not just whether events move, but whether the product remains correct when execution takes the messy path.

The validation logic we care about looks more like invariants than screenshots:

TypeScript

expect(summary.requestsTotal).toBe(21);
expect(summary.succeededTotal).toBe(17);
expect(summary.failedTotal).toBe(4);
expect(summary.failedRequestRetryCount).toBe(0);

for (const region of ['us', 'eu', 'jp'] as const) {
  expect(regionState[region].confluent.topicsOk).toBe(6);
  expect(regionState[region].confluent.subjectsOk).toBe(5);
  expect(regionState[region].outbox.publishedGenerationRows).toBe(14);
  expect(regionState[region].databricks.bronzeRows).toBe(14);
}

for (const failed of insufficientCreditFailures) {
  expect(failed.providerExecutionStarted).toBe(false);
  expect(failed.creditReserved).toBe(false);
  expect(failed.creditCharged).toBe(false);
  expect(failed.failureEventPublished).toBe(true);
}

Text

Insufficient credits -> failure telemetry, no provider execution, no charge
Publisher unavailable -> outbox waits, API continues
Confluent unavailable -> publisher retries, API continues
Structured Streaming paused -> Confluent retains events and the stream catches up later, API continues
Upstash unavailable -> static provider order, API continues
Provider timeout -> failover, API continues when another provider succeeds

The architecture is only useful if it works when something else is having a bad day.

Why Confluent is the right layer

Confluent gives BabySea a managed regional Kafka and Schema Registry surface for durable event movement. That matters because the platform has multiple event families, multiple regional boundaries, and multiple downstream consumers. The operational database should not be responsible for serving every analytics, streaming, validation, and learning workload directly.

Confluent is the right middle layer because it gives us:

durable regional event transport
governed schema evolution
replayable event history within retention
producer and consumer isolation
a stable ingestion path into Databricks
operational visibility around the event spine

But Confluent is valuable only because the boundary is clear.

If the API synchronously depended on Kafka publish, the system would be fragile. If Confluent became the credit or generation source of truth, the system would be confused. If Databricks became a request-path dependency, the learning loop would become a reliability risk. If Upstash cache misses broke routing, the cache would be overfit.

The value is not that BabySea added Kafka.

The value is that BabySea added a regional event spine without weakening the core execution path.

The best infrastructure layers make the system more capable without making the product more fragile.

The broader point

AI infrastructure is becoming a distributed systems problem faster than most teams expected.

Calling models is the easy part. The hard part is controlling execution across providers, regions, failures, credits, webhooks, customer workloads, and learning loops. Every request produces operational knowledge. Every provider attempt is a signal. Every timeout teaches the system something.

But learning from execution requires moving facts safely.

That is what the Confluent spine gives BabySea: a governed, regional, schema-valid way to move execution facts out of the operational core and into the systems that can learn from them.

The product database remains authoritative. Confluent distributes facts. Databricks Spark Structured Streaming lands those facts into Delta Lake. The lakehouse turns them into intelligence. Upstash serves the learned artifact. The API owns request correctness.

That separation is the architecture.

And it is why BabySea can keep improving provider selection, operational analytics, and execution quality without letting the intelligence plane become a liability for customer traffic.

The future of AI infrastructure belongs to systems that can learn continuously and fail safely.

That is what we are building.

Acknowledgement

Thanks to the Confluent team for accepting BabySea into the Confluent for Startups program. The regional event spine described above runs on Confluent Cloud Kafka and Schema Registry, and the program has been a meaningful part of how this layer of BabySea was designed and validated.