How Zepto's DataPortal cut Databricks costs by routing 1,000+ jobs to the right compute
Zepto is an Indian quick-commerce company that promises 10-minute grocery delivery. That promise depends on data. Demand forecasting, inventory optimization, rider routing, and dozens of daily dashboards for business teams all run on top of their data infrastructure. At the scale they operate (millions of orders, 200+ TB processed daily), even small inefficiencies in how compute resources get allocated compound quickly.
Late last year, Zepto's data engineering team noticed something that should have been obvious but wasn't: they were running straightforward SQL reporting queries on full Spark job clusters. Driver nodes, worker nodes, distributed shuffle infrastructure, the whole apparatus. For a query that amounts to "give me yesterday's GMV by city," that's an enormous amount of overhead.
They didn't fix this by manually re-configuring jobs or writing migration scripts. They built a platform. It's called DataPortal, and it handles compute routing automatically.
This post covers what they built, how the routing works at a technical level, and what the results look like on a real Databricks deployment.
What DataPortal actually is
DataPortal is Zepto's internal no-code data platform. It's not just a compute routing engine. It handles the full pipeline lifecycle: creation, orchestration, execution, governance, and monitoring. Analysts, engineers, and business users can build and manage data pipelines without writing DAG code or configuring clusters manually.
Under the hood, every pipeline created in DataPortal gets translated into an Airflow DAG. The platform handles scheduling, retry logic, dependency management (both pipeline-level and workflow-level), and alerting. Compute runs on Databricks-managed Spark clusters with auto-terminating ephemeral instances to keep costs low.
The platform also includes bidirectional connectors between systems like Databricks, Google Sheets, S3, Kafka, Slack, StarRocks, and internal databases. So a pipeline might pull from Kafka, transform in Databricks, and land results in Google Sheets for a business team. DataPortal manages the full chain.
The important part for this post: DataPortal has a routing layer that decides where each job executes. And that routing layer is what produced the headline results.
The problem: 1,000+ SQL jobs on Spark clusters
When Zepto's data infrastructure was smaller, every Databricks job ran on a standard Spark job cluster. That was fine when they had a few dozen pipelines. Spark handles anything you throw at it. The issue is that "handles anything" and "handles this specific thing well" are not the same.
As the pipeline count grew past 6,000, with more than 1,000 of those being scheduled reporting jobs, a pattern became obvious. The majority of those reporting jobs were pure SQL. No Python. No Scala. No complex multi-stage transformations. Just SELECT statements with joins, aggregations, and some basic window functions, running against Delta Lake tables.
These jobs were spinning up Spark job clusters every time they ran. That means: cluster provisioning time (minutes of waiting before any actual computation starts), full JVM initialization for driver and worker nodes, distributed execution overhead for queries that could easily run on a single node, and Spark's generalized query planner instead of a SQL-optimized one.
The morning dashboards were the most visible symptom. Business teams were waiting 40+ minutes for yesterday's numbers. Not because the underlying data was slow to arrive, but because the compute infrastructure was massively oversized for the workload.
How the routing works
Instead of manually migrating each job, Zepto's team built intelligent routing directly into the DataPortal/Airflow layer. When a pipeline triggers, the routing system evaluates several properties of the job before selecting an execution engine:
Query pattern: Is this a SQL-only job, or does it involve Python/Scala? SQL-only jobs are candidates for SQL warehouse routing.
Write mode: The system checks whether the job uses append, merge (upsert), or partition overwrite. This matters because Databricks SQL warehouses don't support schema evolution for partition overwrite mode. Jobs with partition overwrite stay on Spark clusters.
Historical runtime and metadata: The system considers past execution performance to validate that the routing decision will actually improve things.
Data volume: Extremely large-scale transformations that benefit from distributed shuffle stay on Spark.
SQL-reporting jobs that pass these checks get routed to Databricks SQL Serverless Warehouses with Photon enabled. Everything else stays on standard Spark job clusters.
One technical detail that matters: they used Databricks SQL Scripting to dynamically manage the migration across hundreds of jobs. This kept the migration consistent without requiring manual intervention on each pipeline.
The team was explicit about scoping the migration to what would actually work. They didn't try to route partition-overwrite jobs to SQL warehouses just to inflate the migration numbers. That kind of discipline is why the results held up.
Why Photon and serverless made the difference
The performance gains weren't just from avoiding Spark overhead. Two specific Databricks features did most of the heavy lifting.
Photon engine. SQL Warehouses on Databricks come with Photon enabled by default. Photon is a vectorized query engine that runs native C++ code instead of the JVM-based Spark SQL execution. For read-heavy SQL workloads, the difference is massive. Photon processes data in columnar batches rather than row-by-row, which is exactly how most reporting queries scan data. Spark job clusters don't get Photon by default, so the same SQL query can run several times faster just by switching the execution engine.
Serverless instant-on compute. Standard Spark job clusters have a cold-start penalty. Provisioning nodes, initializing the JVM, loading libraries, and setting up the driver can take several minutes before the first query starts executing. SQL Serverless Warehouses eliminate most of that. They spin up in seconds, and they auto-scale and auto-terminate based on demand. For a scheduled job that runs a 30-second query, the difference between a 5-minute cold start and a 10-second one is the difference between a pipeline that "takes 6 minutes" and one that "takes 40 seconds."
The pricing nuance is important too. SQL Serverless Warehouses charge a higher per-DBU rate than Spark job clusters. That looks more expensive on paper. But DBU cost is rate multiplied by time. When Photon cuts a query from 40 minutes to 4, you're paying a higher rate for a tenth of the duration. The math works out clearly in Zepto's favor for these workloads.
This wouldn't hold for every job type. Long-running CPU-bound Python transformations or jobs with large shuffles wouldn't benefit from SQL warehouses. Some might even cost more. That's precisely why routing matters. You need the system to know the difference.
Results
After migrating 1,000+ scheduled reporting jobs through the routing layer:
- Median job runtime dropped 78%. The typical reporting job went from roughly 40 minutes to 4-5 minutes.
- Annualized cost savings of over $35k (combined DBU and EC2), specifically from the x-small warehouse tier.
- Dashboard freshness improved significantly. Zepto's team noted that the recurring "why is yesterday's data not ready yet?" Slack messages stopped.
The 78% is a median, which means half the migrated jobs improved even more. And the $35k savings came from just the initial wave. As more jobs get routed and as the team optimizes warehouse sizing, the savings will compound.
Worth noting: DataPortal now manages over 6,000 active pipelines processing 200+ TB daily. The routing layer is a piece of a much larger platform, but it's the piece with the most immediate, measurable impact on both cost and speed.
What other Databricks teams can learn from this
You don't need to build a full DataPortal to capture similar wins. The core lesson is simple: audit your Databricks workloads by execution type, not by business domain.
Pull your job cluster usage from the last 30 days. Look at the query profiles. How many of your scheduled jobs are pure SQL? How many of those are running on Spark job clusters with Photon disabled? That's your savings estimate.
Some specific things to look for:
Simple SQL reporting on job clusters. If a job is a SQL SELECT with basic joins and aggregations, running it on a SQL Serverless Warehouse with Photon will almost certainly be faster and cheaper.
All-purpose clusters running scheduled jobs. All-purpose clusters charge a higher DBU rate than both job clusters and SQL warehouses. If you're running production scheduled jobs on all-purpose clusters (this is more common than people admit), that's the most expensive possible option.
Long idle times between queries. If a cluster spins up, runs a 30-second query, sits idle for the auto-termination window (usually 5-10 minutes), and then shuts down, you're paying for 5-10 minutes of idle compute. SQL Serverless Warehouses scale down much faster.
The hard part is doing this systematically rather than one job at a time. Zepto's answer was to build routing into their platform so the decision happens automatically. If you don't have a platform layer like DataPortal, you can start with a manual audit and a policy change: "new SQL reporting jobs must use SQL warehouses." That doesn't fix the existing jobs, but it stops the bleeding.
The longer-term answer is what Zepto built: a layer between users and compute that makes the right choice by default. That takes investment, but the payoff scales with every pipeline you add.
What this doesn't address
DataPortal's routing optimizes where jobs execute. It doesn't optimize what they execute. A query that joins six tables when it only needs two will still be slow on any engine. A pipeline without data quality tests will still produce wrong numbers, just faster.
Zepto's platform includes governance and monitoring features (RBAC, audit logs, alerting), which helps. But the compute routing layer itself is purely a performance and cost optimization. If your team's biggest problem is data quality or governance rather than compute efficiency, routing won't fix that.
Also, the routing constraints evolve with the platform. When Zepto started, they excluded partition overwrite jobs from SQL warehouse routing because of schema evolution limitations. Databricks may fix that constraint in a future release, and the routing rules would need to update. Someone has to own that maintenance. The routing layer isn't a set-and-forget system.
