Why cost-aware compute routing is the next feature your data platform needs
There's a pattern showing up across the data platform teams I pay attention to. Zepto built an internal DataPortal that routes Databricks jobs between Spark clusters and SQL warehouses based on workload metadata. Netflix built Data Bridge, a unified control plane that abstracts execution engines away from users entirely. Uber built data access proxies that route Presto, Spark, and Hive queries to different clusters based on query weight and data location.
Three companies, three different scales, three different stack choices. Same architectural idea: put a routing layer between users and compute, and let the platform decide where each job runs.
This isn't a coincidence. It's what happens when data infrastructure scales past the point where a single compute configuration makes sense for all workloads. And most companies hit that point much earlier than they realize.
This post explores the pattern, what it solves, when you need it, and what it actually takes to implement.
The default: one compute tier for everything
When a data team starts out, they pick one compute setup and run everything on it. A Snowflake warehouse sized for their biggest query. A Databricks cluster configured for their most complex pipeline. An Airflow instance with a fixed worker pool. This is the right call early on. Simplicity has value, and the waste from running small queries on oversized compute is invisible when you have 30 pipelines.
The problem creeps in over months, not days. You go from 30 pipelines to 300. Then to 3,000. The workload mix diversifies. You now have lightweight SQL reporting queries sitting next to heavy PySpark ETL jobs, sitting next to ML feature pipelines, sitting next to real-time aggregations. They all share the same compute, and most of them are getting the wrong amount of resources.
The light queries are over-provisioned (paying for distributed compute they don't need), and the heavy jobs are sometimes under-provisioned (fighting for resources with everything else). You see this in practice as slow morning dashboards, unexpectedly expensive Snowflake bills, or Airflow task queues that back up during peak hours.
Most teams respond with one of two strategies, neither of which scales.
Strategy 1: Manual right-sizing. An engineer audits the top 20 most expensive jobs and adjusts their compute allocation. This works for a quarter, then workloads change, new pipelines get added with default settings, and the problem returns. Manual right-sizing is recurring operational tax that nobody enjoys paying.
Strategy 2: Cost dashboards. The team deploys a dashboard that shows spend by pipeline, team, or warehouse. Everyone looks at it in the monthly review. Nobody does anything. Cost dashboards create awareness, but awareness without automation doesn't change behavior at scale. I've talked to dozens of data teams about this, and the pattern is consistent: the dashboard exists, the numbers are visible, and the pipelines still run on the same compute tier they were configured with on day one.
The routing pattern
The alternative is compute routing: a layer in your data platform that evaluates each job's characteristics and sends it to the appropriate execution engine automatically.
Zepto's version is built into their Airflow-based DataPortal. When a pipeline runs, the routing system evaluates query pattern, write mode, historical runtime, and data volume. SQL reporting jobs get sent to Databricks SQL Serverless Warehouses with Photon. Complex transformations stay on Spark job clusters. The result was a 78% median runtime reduction on migrated jobs and $35k in annualized cost savings.
Netflix's version is more ambitious. Data Bridge is a full control plane that sits between users and execution engines. Users describe what they want (move data from A to B on this schedule with this transform) and Data Bridge selects the how. The execution layer is hot-swappable, meaning the platform team can upgrade or replace the underlying engine without touching user workflows. At 300,000 jobs per week across 36+ source-destination pairs, manual compute selection would be operationally impossible.
Uber's approach predates both. Their data access proxies route Presto queries to different clusters based on query analysis. Heavy queries go to high-resource clusters. Light queries go to smaller ones. Their Presto router even checks Hive Metastore to determine whether queried tables are on-prem or in the cloud, routing accordingly. At Uber's scale (petabytes daily, tens of thousands of Flink jobs and Kafka topics), they also implemented tiered job priority with different resource guarantees and charge-back rates per tier.
The common elements across all three:
Routing decisions based on job metadata, not user choice. Users don't pick their compute tier. The platform picks based on what the job actually needs.
Separation of user intent from execution details. Users say "what." The platform handles "how." This decoupling is what enables transparent engine migrations and upgrades.
Feedback loops from execution history. Routing rules aren't static. They incorporate past performance data to refine decisions over time. Zepto uses historical runtime. Uber tracks cost-per-query and efficiency metrics through their DataCentral observability platform.
When you need this (and when you don't)
Not every team needs compute routing. Here's how I'd think about it.
You probably need it if: You have 500+ pipelines with meaningfully different compute profiles. Or your Snowflake/Databricks bill has grown 2x+ in the last year without a corresponding increase in business value. Or your team spends more than a few hours per quarter manually adjusting warehouse sizes and cluster configs. Or you're planning a compute engine migration (say, Spark 3.x to 4.x, or Snowflake to Databricks) and dreading the per-job migration work.
You probably don't need it if: Your workloads are homogeneous (all dbt models, all roughly the same size). Or your total pipeline count is under 100 and your compute bill is predictable. Or you haven't yet standardized on an orchestrator. Fix that first. Routing on top of chaos is just more chaos.
The threshold is lower than most teams think. You don't need Uber-scale data volumes to benefit from routing. A team with 200 Databricks pipelines where 60% are simple SQL reporting will see meaningful savings from routing those to SQL warehouses. The economics work at mid-market scale, not just hyperscaler scale.
What it takes to build
There are three levels of investment, roughly corresponding to three levels of maturity.
Level 1: Manual audit with policy enforcement. No routing layer. An engineer audits the top 100 most expensive pipelines, reclassifies them by workload type, and migrates the obvious wins. Then you set a policy: new SQL reporting jobs must use SQL warehouses (on Databricks) or small warehouses (on Snowflake). Existing jobs stay where they are until the next audit.
This takes a week of engineering time. It captures maybe 60% of the savings. The remaining 40% requires automation because the manual audit goes stale within a few months.
Level 2: Metadata-based routing in your orchestrator. You build routing logic into your Airflow DAGs (or Dagster jobs, or Prefect flows). Before execution, a routing function evaluates the job's metadata, maybe from a YAML config, from query analysis, or from a tagging system, and selects the appropriate compute tier.
This is roughly what Zepto built. It requires a few weeks of engineering work to implement and ongoing maintenance to keep the routing rules current as your compute options evolve. The payoff scales with pipeline count: every new pipeline automatically gets the right compute without manual configuration.
Level 3: Full control plane with execution abstraction. You build a Data Bridge-style control plane that completely abstracts the execution layer. Users interact with an intent-based API. The platform handles connector selection, engine configuration, monitoring, and lifecycle management. The execution layer is hot-swappable.
This is months of engineering investment and requires a dedicated platform team. It makes sense when you have thousands of pipelines, multiple execution engines, and a need for transparent infrastructure migrations. Most companies don't need this. Netflix and Uber do. Most mid-market teams will get 80% of the value from Level 2.
The governance caveat
I want to be direct about something: compute routing optimizes performance and cost. It does not fix governance.
A query that computes revenue incorrectly will compute revenue incorrectly faster on a SQL Serverless Warehouse than on a Spark cluster. A pipeline without data quality checks will ship bad data to dashboards regardless of which engine ran it. A table without documentation will be just as confusing whether it refreshes in 40 minutes or 4.
I've talked to teams that conflate "our data platform" with "our Databricks/Snowflake setup." The platform is more than compute. It's governance (who owns this table, what tests validate it, what happens when it breaks), quality (freshness checks, schema drift detection, row count anomalies), and discoverability (can an analyst find the right table without asking three engineers). Routing is the compute optimization layer. It sits alongside governance and quality, not above them.
If your biggest problem is that three teams compute revenue differently and nobody knows which dashboard to trust, compute routing won't help. Fix the governance gap first. Then optimize the compute.
Where this is going
Two trends are pushing compute routing from "nice to have" to "table stakes" over the next few years.
Engine proliferation is already here. Five years ago, most data teams had one compute engine. Now it's common to see Spark, Databricks SQL, DuckDB, Snowflake, and sometimes StarRocks or Trino in the same org. Each engine has a sweet spot. DuckDB for local development and small transforms. Snowflake or Databricks SQL for interactive analytics. Spark for large-scale ETL. StarRocks for real-time OLAP. The more engines you have, the more valuable routing becomes, because picking the wrong one wastes more money and time.
Then there's AI-assisted pipeline generation. As more teams use LLMs or AI tools to generate pipeline code (dbt models, SQL transforms, Airflow DAGs), the volume of pipelines will increase faster than teams can manually configure compute for each one. Routing becomes the safety net that ensures AI-generated pipelines don't accidentally run on the most expensive compute tier by default.
Both of these are happening now. Engine proliferation is obvious if you look at any modern data team's stack. And AI-assisted pipeline generation is already in production at companies building with tools like Zingle AI. The teams that build routing into their platforms now will have an easier time absorbing both trends. The teams that don't will be doing manual audits and migration sprints for the foreseeable future.
