How BabySea built adaptive provider selection on Databricks

Date: April 30, 2026
By: Randy Aries Saputra

BabySea is the execution control plane for generative media: one API across 80+ image and video models from 12+ AI labs, served by 8+ inference providers, running in three sovereign regions: US, EU, and APAC. The hard problem is not exposing models. The hard problem is execution: deciding, per request, which provider should actually run a given model in a given region under real latency, reliability, and cost conditions. Static routing breaks down quickly in that environment, and Databricks is the intelligence layer that makes our provider selection adaptive instead of static.

At BabySea, every customer execution produces an attempt record (provider, model, submitted/resolved timestamps, outcome, estimated cost, failover order, and cancellation state) and that attempt stream lands first in our regional Supabase/Postgres systems, because Postgres is the operational source of truth for the product. The deployment region is kept as a regional stack boundary and written into Databricks Bronze/Gold outputs. From there, Databricks runs the learning loop end to end in production. The architectural decision that matters most is this: we did not build a detached analytics estate beside the product, we made Databricks part of the control loop that improves execution outcomes while keeping the customer-serving path operationally safe.

Databricks is on the learning path, not the request path

That boundary is the whole design. A common failure mode in "smart routing" systems is to let the intelligence plane become a hard dependency for serving traffic, so the first time the ranking service degrades, the customer-visible path degrades with it. We built the opposite system: Databricks computes the ranking, the API consumes a cached export of that ranking in milliseconds, and the API always retains deterministic fallback behavior if Databricks, Model Serving, or cache infrastructure degrade.

Adaptive selection is an enhancement, never a dependency.

The runtime chain is straightforward: customer traffic produces provider-attempt telemetry in regional Supabase/Postgres, Databricks reads that telemetry directly and transforms it into provider-performance and provider-ranking artifacts, those rankings are exported into Upstash, the API reads the exported ranking for generation_provider_order: "fastest" while circuit-breaker logic and failover stay in place, and later customer traffic generates new telemetry that closes the loop. That loop is the core of the design, and every workload makes later workloads smarter.

Here is what the API actually does when a customer asks for "fastest". This is the hot path, not pseudocode:

TypeScript

const deploymentRegion = getDeploymentRegion();

if (providerOrder === 'fastest') {
  const ranking = await getCachedRanking(model, deploymentRegion);
  const configuredProviders = new Set(Object.keys(providerLookup));

  const fromCache = ranking?.providersRanked.filter((p) =>
    configuredProviders.has(p),
  );

  if (fromCache && fromCache.length > 0) {
    resolvedOrder = fromCache.join(', ');
    providerOrderSource = 'cached-ranking';
  } else {
    resolvedOrder = converters.providerDisplayOrder
      .filter((p) => configuredProviders.has(p))
      .join(', ');
    providerOrderSource = 'fallback';
  }
}

Notice what is not in that code. There is no Databricks SDK call, no MLflow client, and no Mosaic AI Model Serving fetch on the default path. The hot path reads one Upstash Redis-protocol key, filters it against the providers configured for the requested model, and moves on. Everything Databricks does happens upstream of this line.

Zero-ETL ingress with Unity Catalog and Lakehouse Federation

The first Databricks decision was to avoid copying operational data unless we had a compelling reason to. We use Unity Catalog Lakehouse Federation to connect each regional Databricks workspace to its matching regional Supabase/Postgres database through a PostgreSQL foreign connection. That gives us a governed entry point into operational routing telemetry without introducing another ETL system, another CDC estate, or another set of consistency problems between the product database and the learning system. This is not just a convenience feature, it changes the operating model.

Because the federated source sits under Unity Catalog, governance starts at ingress. Federated source columns carry governed tags such as gdpr_status, tenant_boundary, and ai_telemetry, and those tags matter because the routing system is not just reading "logs", it is reading multi-tenant, region-sensitive execution data that will feed training, ranking, analytics, and operational validation. That means the same control plane that reads provider telemetry also knows which fields are tenant-isolating keys, which fields are compliance-sensitive, and which fields are cost or inference telemetry, with lineage originating at the operational source rather than at a copied derivative. For a multi-region AI infrastructure company, that is a cleaner foundation than shipping raw operational tables into an ungoverned analytics sink and reconstructing meaning later.

Medallion design for routing intelligence, not generic analytics

From the federated source, we refine the data through a Lakeflow Spark Declarative Pipeline into a Bronze, Silver, Gold medallion on Delta Lake. This is not medallion architecture as ceremony: each layer has a specific role in the routing loop.

Bronze

Bronze lands the federated operational tables into Delta-backed materialized views. Its job is not business logic, its job is isolation and reproducibility: isolate the product database from downstream transformations, preserve lineage, and establish a Delta-native substrate for analytics, training, and export. In our production stack, the most important Bronze inputs are the attempt-log path that eventually becomes routing intelligence.

Silver

Silver is where the routing contract becomes explicit. Provider attempts are cleaned, typed, normalized, and enriched with routing-relevant fields such as latency, success, wasted-attempt semantics, time buckets, and other derived fields needed for aggregation and training. Silver is where "application logs" stop being logs and become a feature-bearing execution dataset, and if you want reliable model training, reliable provider-performance analytics, or reliable ranking generation, this is the layer that has to be correct.

Gold

Gold produces artifacts the rest of the platform can consume. That includes provider-performance aggregates, but the most important output is gold_provider_ranking_by_model, which is the bridge between Databricks and the API.

That Gold table is not a BI table. It is a runtime contract.

gold_provider_ranking_by_model: the bridge between data and execution

gold_provider_ranking_by_model emits one row per model and region, with ranked provider lists, scores, attempt totals, ranking-window metadata, and a computation timestamp. In the current production implementation, the ranking shape combines success rate, wasted-attempt rate, and latency normalization into a single score over a 24-hour window, which gives us a deterministic, explainable, continuously refreshed routing artifact even before any optional real-time model scoring enters the picture.

Conceptually, the production scoring shape is:

Text

score = success_rate      * 1.0
      - wasted_rate       * COST_PENALTY
      - latency_p95_norm  * LATENCY_PENALTY

with current constants:

Text

COST_PENALTY = 0.5
LATENCY_PENALTY = 0.3
RANKING_WINDOW_HOURS = 24

The reason this design works is simple. First, it gives the serving layer a compact, explicit object to consume: the API does not need raw execution logs, ad hoc joins, or online feature synthesis, it needs a ranking. Second, it lets the ML path and the non-ML path converge on the same abstraction, so whether the ranking is produced from direct aggregates, a registered predictive model, or both, the serving contract remains stable: here is the ordered provider set for this model in this region. That is what makes the system evolvable without destabilizing the API surface.

MLflow, Models in Unity Catalog, and regional model lifecycles

On top of the medallion, we train a predictive routing model nightly with MLflow and register the resulting artifact in Models in Unity Catalog on a per-region basis (babysea_us.ml.predictive_routing, babysea_eu.ml.predictive_routing, babysea_apac.ml.predictive_routing). In the implementation snapshot we documented, the latest READY model versions were US v8, EU v3, and APAC v3.

That regional separation is not cosmetic. We operate across sovereign regions, and provider behavior, customer demand shape, cost conditions, and product constraints do not collapse neatly into a single global policy, so treating US, EU, and APAC as distinct learning domains is cleaner technically and safer operationally.

The current production training notebook uses a value-prediction approach with a scikit-learn GradientBoostingRegressor, not XGBoost. The serving features are intentionally compact (provider, model, hour_of_day, day_of_week) and the output is one scalar predicted value per provider candidate where higher means better. That score is not exposed directly to customers, it is one input into execution ordering, and the customer-facing abstraction remains stable even while the predictive layer evolves.

Mosaic AI Model Serving as an optional refinement layer

We provision regional Mosaic AI Model Serving endpoints for controlled rollout, which gives us a low-latency scoring path that can re-score candidate providers after cached ranking and health-based reordering when explicitly enabled. But the important point is not that we can call Model Serving, the important point is that we do not need to. The API code makes that explicit, and three independent kill switches gate every Model Serving call:

TypeScript

const ENABLED = process.env.DATABRICKS_PREDICTIVE_ROUTING_ENABLED === 'true';
const TRAFFIC_PCT = clamp(
  Number(process.env.DATABRICKS_PREDICTIVE_ROUTING_TRAFFIC_PCT ?? '0'),
  0,
  100,
);
const TIMEOUT_MS = Number(
  process.env.DATABRICKS_PREDICTIVE_ROUTING_TIMEOUT_MS ?? '50',
);

export async function reorderByPredictedValue(sequence, ctx) {
  if (sequence.length <= 1) {
    return { reordered: sequence, decision: 'noop' };
  }

  if (!ENABLED) {
    return { reordered: sequence, decision: 'disabled' };
  }

  if (TRAFFIC_PCT < 100 && Math.random() * 100 >= TRAFFIC_PCT) {
    return { reordered: sequence, decision: 'sampled-out' };
  }

  // timeout-bounded fetch to Mosaic AI Model Serving
  // fallback on any error
}

Disabled by environment variable, sampled out by traffic percentage, bounded by timeout, and any failure is absorbed and the original sequence is returned.

The difference between ML-enhanced infrastructure and ML-fragile infrastructure is whether the model endpoint is a hard dependency for every request. We do not let it become one.

This is the architectural discipline that keeps the system safe.

Lakeflow Jobs as unattended orchestration

The control loop is operationalized through Lakeflow Jobs. In production, the daily schedule is 02:30 UTC to export the Gold ranking into Upstash and 03:00 UTC to retrain and re-register the predictive model. Across three regions, that gives us six unattended jobs coordinating the routing loop: export and training in US, EU, and APAC.

This schedule is not arbitrary. The export cadence, ranking window, and cache TTL are chosen as a system: a 24-hour Gold ranking window, a 48-hour Upstash TTL, and daily export cadence. That 48-hour TTL being deliberately longer than the 24-hour refresh cadence is what gives the system fail-open resilience, because yesterday's valid ranking can continue to serve even if today's job does not run. This is a subtle but important lesson in production ML systems: freshness is not the only variable that matters, graceful degradation matters just as much.

Upstash export and the hot-path contract

The API never calls Databricks on the customer request path. Instead, a Lakeflow Job exports each Gold ranking row into Upstash using a key of the form predictive:ranking:<region>:<model>, and the value includes providers_ranked, scores, attempts_total, window_hours, and computed_at. Example shape:

JSON

{
  "providers_ranked": ["cloudflare", "replicate", "fal"],
  "scores": {
    "cloudflare": 0.81,
    "replicate": 0.42,
    "fal": -0.05
  },
  "attempts_total": 137,
  "window_hours": 24,
  "computed_at": "2026-04-29T02:31:15Z"
}

That payload is not an implementation detail, it is a contract between systems written in different languages and operating at different latencies. The TypeScript cache reader mirrors the shape exactly and is written to fail open by construction:

TypeScript

function parseRanking(raw: unknown): CachedProviderRanking | null {
  if (!raw || typeof raw !== 'object') return null;
  const r = raw as RawRanking;

  if (!Array.isArray(r.providers_ranked)) return null;

  const providersRanked = r.providers_ranked.filter(
    (p): p is ProviderName => typeof p === 'string' && p.length > 0,
  );

  if (providersRanked.length === 0) return null;

  // validate scores
  // return null on any structural issue
}

Every structural failure returns null rather than throwing, and null is the signal to fall back to deterministic provider order.

The intelligence layer is allowed to be absent. It is not allowed to take down a request.

Operationally, the split is clean: Databricks is responsible for learning, Upstash's Redis-protocol cache is responsible for serving the learned artifact cheaply, and the API is responsible for request execution and last-mile resilience.

That separation lets each subsystem do one thing well.

Databricks SQL on the same governed substrate

Databricks SQL serverless warehouses sit on the same governed catalogs and materialized outputs that power training and export. We use them for internal analytics, validation, and operational inspection, which means the numbers engineering sees, the numbers product sees, and the numbers the routing system acts on are not computed from disconnected pipelines, they are different consumers of the same governed substrate. For infrastructure teams, that is one of the real advantages of using the lakehouse model properly: analytics, orchestration, governance, and model lifecycle can share one set of contracts instead of spawning parallel stacks.

Fail-open as a first-class systems property

The most important design choice in the stack is probably not the model, the ranking function, or the federation layer, it is fail-open behavior. If Upstash is unavailable, the system falls back to deterministic provider order. If the ranking key is missing or malformed, it falls back. If Mosaic AI Model Serving is disabled, slow, or returning bad responses, it falls back. If Databricks compute is paused entirely, the API still serves traffic until the cache expires, and after that continues to serve through deterministic order and the normal provider failover loop.

Customer traffic is never hostage to the intelligence layer.

This sounds obvious, but in practice many teams do not build it this way: they succeed at making routing smarter, but fail at making it safe. For us, adaptive provider selection only makes sense if it improves execution outcomes without increasing execution fragility.

The actual effect: regional, governed, compounding execution quality

Once the loop is live, the effect is simple to describe and difficult to reproduce without the right architecture: every customer workload can make later workloads faster, cheaper, and more reliable. Provider attempts do not disappear after execution, they become training data; training data does not remain trapped in notebooks, it becomes ranked artifacts; ranked artifacts do not remain trapped in dashboards, they alter later execution decisions. All of it remains regionalized, governed, lineage-aware, and operationally safe. That is what "Databricks as the intelligence layer" means in our stack: not analytics beside the product, but intelligence inside the execution loop.

Why we open-sourced "adaptive-island"

We also open-sourced the engine under Apache 2.0 as adaptive-island, the reusable pattern behind this architecture. The repository packages the production-grade v0.1 contract we can support publicly: a Supabase provider_cost_log source contract or adapter view, Bronze/Silver/Gold Lakeflow pipelines on Delta Lake, the gold_provider_ranking_by_model scoring contract, a Gold-to-Upstash export job, cache-first Python and TypeScript SDKs, optional MLflow value-model training, an optional Mosaic AI Model Serving entry point, a Databricks Asset Bundle deploy path, and a real-stack smoke harness for Databricks, Supabase, and Upstash. But the open-source project is not a repo dump of our internal production code, it is a generalization of the pattern for the industry.

The OSS boundary is explicit. adaptive-island does not ship request-path Databricks calls, stochastic online exploration, propensity logging, inverse-propensity or SNIPS promotion gates, alternate production caches, alternate warehouses, queues, search indexes, or a managed hosted service. If a capability cannot be traced to the BabySea Supabase ➜ Databricks ➜ Upstash provider-ranking loop and is not implemented in the OSS, it stays out of the public contract. An automated MLflow offline eval and promotion gate is tracked separately in the BabySea implementation guide; it is not a shipped adaptive-island feature today.

What the OSS packages is the production-derived pattern: real feedback from Supabase attempt logs, governed Databricks medallion transforms, deterministic Gold ranking over success, wasted-attempt, and latency behavior, Upstash cache serving, SDK payload validation, provider allowlist filtering, and deterministic fallback. Most teams building multi-provider AI systems independently rediscover the same constraints (heterogeneous APIs, unstable latency distributions, inconsistent cost surfaces, provider-specific failure modes, no principled feedback loop from real outcomes back into routing) and then over-index on hand-tuned routing logic, brittle failover code, or static preference matrices. adaptive-island packages a better default: treat provider selection as a governed learning system with a cache-served serving contract and explicit fail-open behavior. That is why we believe the OSS matters, it helps the industry skip a category of avoidable reinvention.

What we kept, and what we gave away

Open-sourcing the pattern does not give away the moat.

The pattern is reusable, and it should be reusable. The workload data is not. Neither are the execution graph, customer demand shape, provider mix, product constraints, or regional traffic behavior.

We kept the data flywheel and the customer-facing product, and the community gets the architecture, the contracts, and the operating model. That is the right boundary for this kind of infrastructure company. The proprietary advantage is not that we know medallion architecture exists, or that MLflow can register models, or that rankings can be cached, the proprietary advantage is the workload flowing through the loop and the compound execution quality that loop produces over time. adaptive-island makes the pattern legible and deployable, BabySea keeps the production flywheel.

The broader point

There is a larger lesson here for AI infrastructure. As the industry matures, the winning systems will not be the ones with the most wrappers or the largest configuration files, they will be the ones that convert real execution outcomes into governed, compounding control loops without making the serving path fragile. That is the problem we used Databricks to solve, the pattern we open-sourced, and the direction we think the industry should move.