Why your data pipelines collapse at scale (and nobody warns you)
Last quarter, I was on a call with a data lead at a Series E fintech. His team had 4,000+ data pipelines running in Airflow. I asked him how many of those were still serving active dashboards. Long pause. "At least half of them," he said. "Maybe."
That conversation keeps coming back to me because it's the same story everywhere. Not because these are bad teams. They're usually very good teams that grew fast and didn't have time to stop and organize the mess accumulating behind them.
This post is about that mess. Why data pipeline complexity doesn't grow linearly with your company, why the tools you already have won't fix it on their own, and what actually breaks first when things get bad.
Five pipelines are easy. Five hundred are a different job entirely.
When your data team is small, everything is manageable. You have a handful of dbt models, one Airflow instance, a few Fivetran connectors pulling into Snowflake. One or two engineers know how all of it works. If something breaks at 2am, they fix it from muscle memory.
Then the company grows. Marketing wants attribution data. Finance needs revenue reconciliation. Product is running experiments and needs event pipelines. The ML team wants feature stores. Each request is reasonable on its own. Each one adds a pipeline, a schedule, a set of assumptions about upstream data.
At 50 pipelines, you start needing documentation. At 200, you need someone whose full-time job is understanding dependencies. At 500, nobody has the full picture anymore.
A 2025 Fivetran survey found that 67% of large data-centralized enterprises spend more than 80% of their data engineering resources just maintaining existing pipelines. Not shipping anything new. Just keeping the lights on. I find that number genuinely bleak. These are expensive engineers doing work that probably shouldn't require this much of their time.
The cost side is equally grim. An Integrate.io survey from the same year reported 44% of companies spending $25,000 to $100,000 per month on their data stack, with only 12% saying they feel good about the ROI. The biggest complaint was lack of cost visibility, which tracks - you can't optimize what you can't see, and at 500 pipelines, nobody can see all of it.
Metrics stop meaning what you think they mean
Pipeline sprawl is annoying. The governance failures it causes are actually dangerous.
An analyst in the growth team writes a SQL query to calculate monthly active users. It works fine. They schedule it in Airflow. Six months later, a product analyst needs the same metric but with slightly different filters. They don't know the first query exists, or they find it but don't trust it, so they write their own. Now you have two MAU numbers. Eventually someone puts both in a board deck and your CFO starts asking questions nobody can answer quickly.
I see this pattern at almost every mid-market company I talk to.
Airbnb dealt with a version of the same thing at much larger scale. As their data warehouse grew, different teams built downstream pipelines off the same core tables but ended up with duplicate and diverging metrics. Trust in the data eroded. Data scientists started second-guessing their own numbers. The engineering team's postmortem was blunt: table standardization alone wasn't enough, because users don't consume tables - they consume metrics. That realization led them to build Minerva, their metrics platform, which now manages over 12,000 metrics across 200+ data producers. It took them years and a dedicated team. Most companies don't have that luxury.
Netflix ran into a related problem on the infrastructure side. Their operational context for thousands of database instances was scattered across emails, spreadsheets, Google Docs, and ad-hoc scripts. They eventually built a centralized system called Eunomia to get a single view of their fleet. Before that, the team was essentially guessing at the state of their own infrastructure.
These are two of the most data-mature companies in the world. If they couldn't manage pipeline sprawl with just tools and talent, a 15-person data team probably can't either. That's not a criticism. It's just the math.
Why your existing tools won't save you
When things start breaking, the instinct is to buy tools. Better orchestration, better monitoring, a data catalog. And those tools do help. Dagster gives you better lineage than vanilla Airflow. Monte Carlo gives you observability. Atlan or DataHub give you a catalog.
But I keep watching teams stack five or six tools on top of each other, and the complexity just moves. Instead of pipeline chaos, you get tooling chaos. Your dbt models are versioned in Git but your Airflow DAGs are configured separately. Your data quality checks run in one system but your alerting runs in another. Your catalog knows what tables exist but has no idea which pipelines matter.
The tools are fine. They just don't produce governed outcomes on their own. They produce options. And when you have 40 analysts writing SQL across Snowflake and BigQuery, options without constraints is how you end up with inconsistency.
A 2025 MuleSoft study found that organizations average 897 applications with only 29% integrated. Data teams have the same problem with different tool names. A dozen tools, each doing its job, none of them connected in a way that enforces any kind of shared standard.
What's missing is an opinionated layer on top of the stack that governs how pipelines get built, not just how they get monitored. I'll get more specific about what that looks like in a later post. The short version is: more tools on top of a broken process just makes the process more expensive.
What this doesn't solve (and what it actually requires)
I want to be honest about the hard parts, because the data tooling market has a bad habit of promising clean solutions to messy organizational problems.
The biggest one is change management. Any governance layer requires buy-in from the people who currently write SQL however they want. Analysts who've been scheduling queries directly in Snowflake for two years will push back hard when you tell them to start opening PRs. That's months of work. Not weeks. I've seen this stall out completely at companies where leadership didn't back the transition visibly and repeatedly.
Governance at the pipeline level also only works if your definitions are actually right. If nobody has agreed on what "active user" means, no platform will prevent three different versions from existing. You need a data modeling conversation before you need infrastructure.
On costs - I've watched teams build beautiful cost dashboards and then change nothing about their behavior for six months. Visibility tells you where the waste is. Somebody still has to make the call to turn off pipelines that other teams are attached to. That's an organizational negotiation, and most teams underestimate how much time it takes.
And there's a size threshold below which none of this is worth it. A three-person data team at a Series A should be moving fast and accepting some mess. Governance infrastructure starts paying for itself somewhere around the point where you have a hundred-plus pipelines, a dozen or more people producing data, and real pressure from finance or compliance on accuracy. Before that, the overhead costs more than the chaos does.
Where this leaves you
Data pipeline complexity grows faster than your team's ability to manage it by hand. Every company I talk to that's crossed the hundred-pipeline mark has some version of this problem. The ones that handle it well - Airbnb, Netflix - all ended up building an opinionated layer between their tools and their users.
If you're starting to feel this, the first useful thing you can do is audit what you actually have. How many pipelines are running? How many serve active use cases versus sitting there because nobody wants to be the person who deletes them? Where are the metrics that different teams are calculating differently?
That audit will tell you whether you have a manageable mess or something that needs structural intervention. Either way, you'll know.
