Databricks at $62B: The Open-Source Bait-and-Switch Is the Best Business Model in Enterprise Software

Databricks gave away Apache Spark, Delta Lake, and MLflow for free. Then it built the governance layer on top and charged enterprises $2.4B a year for the privilege of managing their own data. Snowflake's pivot to open formats is the clearest admission yet: Databricks won the architecture war.

By Erik Sundberg, Developer Tools · Mar 25, 2026 · 14 min read

In 2009, three UC Berkeley PhD students published a paper describing a distributed computing framework called Spark. They open-sourced it the same year. Within four years, every major technology company in the world was running it in production. Within eight years, it had processed more data than any software system in history.

The three researchers — Matei Zaharia, Patrick Wendell, and Reynold Xin — never intended Spark to be a business. They intended it to be a research contribution. But when they founded Databricks in 2013, they had an insight that would turn free software into one of the most valuable companies in enterprise history: you do not need to own the engine. You need to own the dashboard.

Databricks is currently valued at $62 billion. It is generating approximately $2.4 billion in annualized revenue. It grew roughly 50% year-over-year in 2025. And it got there by giving away, for free, the software that almost every major data infrastructure stack in the world runs on.

This is not a coincidence. It is the most deliberate enterprise go-to-market strategy of the last decade.

The Four-Layer Playbook

To understand how Databricks built a $62B company on the back of free software, you need to understand the architecture of its open-source strategy. This was not one open-source bet. It was four, executed sequentially, each one expanding the surface area for monetization.

Layer 1: Apache Spark (2009–2013)

Spark was the proof of concept. The open-source release generated global developer adoption, enterprise deployment at thousands of companies, and an ecosystem of tooling, documentation, and expertise that money cannot buy.

By the time Databricks was founded in 2013, Spark was already embedded in the data pipelines of Google, Netflix, Airbnb, and virtually every data-intensive company on earth. This gave Databricks something that almost no enterprise software startup has: a massive installed base of production users before the commercial product existed.

The business model question was straightforward: what do enterprises need that the open-source Spark cluster does not provide? The answer was everything around the cluster — managed infrastructure, security, collaboration, support, and the reliability guarantees that regulated enterprises require. Databricks built that. It called the product Databricks Unified Analytics Platform. It charged a significant premium to manage the thing enterprises were already running for free.

Layer 2: Delta Lake (2019)

By 2019, Databricks recognized that Spark's primary limitation as a business was that it was stateless. It processed data but did not store it in a Databricks-controlled format. Customers could run Spark on any cloud, with any data, and leave at any time.

Delta Lake changed the equation. Released as open source in 2019, Delta Lake is a storage layer that adds ACID transactions, schema enforcement, and time-travel capabilities to data lakes. It is technically superior to the alternatives — Parquet files without a transaction layer are notoriously fragile — and it is architecturally significant because it introduces a Databricks-controlled metadata layer into the storage architecture.

Delta Lake was not a lock-in mechanism in the traditional sense. The format is genuinely open and portable. But it was a dependency-deepening mechanism: once an enterprise's petabytes of data are stored in Delta format, optimized for Spark, with years of transaction history built up, the friction of moving to a different platform increases dramatically. Data gravity is real, and Delta Lake was designed to exploit it.

The open-source release was critical to adoption. Snowflake and AWS both adopted Delta Lake-compatible APIs, which expanded the ecosystem while simultaneously entrenching the format's position as the de facto open table standard.

Metric	Pre-Delta Lake (2018)	Post-Delta Lake (2022)
Databricks ARR	~$200M	~$800M
Enterprise customers	~1,200	~5,000
Delta Lake GitHub stars	N/A	6,200+
Competing table formats	Parquet (dominant)	Delta, Iceberg, Hudi

The revenue acceleration aligned precisely with Delta Lake adoption. This was not a correlation. It was causation: Delta Lake created the data gravity that made Databricks stickier than pure-compute alternatives.

Layer 3: Unity Catalog (2022)

Delta Lake made data sticky. Unity Catalog made the enterprise impossible to leave.

Unity Catalog is Databricks' unified governance layer — a single platform for managing access policies, data lineage, audit trails, and compliance across all data assets. It was released in 2022 and immediately became the center of Databricks' enterprise sales motion.

Here is why Unity Catalog is the real lock-in play. Governance metadata is not like compute. You can migrate a Spark workload to a new cluster in hours. You can convert Delta Lake tables to Iceberg with a command. But governance metadata — the answer to "who has access to what data, under what policies, with what audit history, tagged with what semantic labels, connected to what lineage graph" — is accumulated organizational knowledge. It takes years to build and cannot be exported.

When an enterprise deploys Unity Catalog, it is not just deploying a feature. It is encoding its data governance strategy into the Databricks platform. Every policy, every role assignment, every lineage connection, every compliance annotation becomes a node in a governance graph that lives inside Databricks. The switching cost is not technical. It is organizational. Leaving Databricks means rebuilding years of governance decisions from scratch on a new platform.

This is the pattern that Microsoft used to build its enterprise dominance: make Active Directory the single source of truth for enterprise identity. Every application that relies on AD becomes an argument for staying in the Microsoft ecosystem. Unity Catalog is Databricks' Active Directory.

Layer 4: Mosaic ML / DBRX (2023–2025)

The $1.3 billion acquisition of Mosaic ML in 2023 added the final layer: AI training.

Mosaic ML's core product was a training platform for large language models — the tooling that lets enterprises fine-tune foundation models on their own data, at lower cost, with better performance than naive fine-tuning approaches. The acquisition gave Databricks LLM training and fine-tuning capabilities that could slot directly into its existing data infrastructure.

The strategic logic is a perfect replay of the Spark-to-Delta Lake playbook. Enterprises running data workloads on Databricks can now train and fine-tune models on the same platform, using the same Unity Catalog governance layer, without exporting their data to an external AI vendor. The data — already governed by Unity Catalog, already stored in Delta Lake — becomes the training corpus. The model — trained and served by Mosaic ML's infrastructure — becomes another workload managed by Databricks.

Databricks also open-sourced DBRX, its own foundation model, in March 2024. The pattern was predictable and deliberate: open-source the model, monetize the training infrastructure. Give away the engine. Charge for the dashboard.

The Numbers Behind the Strategy

Databricks' revenue trajectory is the strongest evidence that the open-core playbook works at scale.

Year	ARR	YoY Growth	Key Open-Source Release
2019	~$200M	~80%	Delta Lake open-sourced
2020	~$350M	~75%	Delta Lake ecosystem expansion
2021	~$600M	~71%	MLflow hits 10M downloads
2022	~$1.0B	~67%	Unity Catalog launched
2023	~$1.6B	~60%	Mosaic ML acquired
2024	~$1.6B	—	DBRX open-sourced
2025	~$2.4B	~50%	AI/BI platform expansion

For comparison, here is how Databricks' trajectory compares to Snowflake's, the company most often cited as its primary competitor:

Metric (FY2026 est.)	Databricks	Snowflake
ARR	~$2.4B	~$4.1B
Revenue Growth (YoY)	~50%	~29%
Gross Margin	~75%	~67%
Net Revenue Retention	~150%+	~127%
Customers >$1M ARR	~600	~510
Valuation	$62B	~$42B (public)

The growth rate differential is the most important number. Snowflake is three years ahead of Databricks on revenue but growing at nearly half the rate. At current trajectories, Databricks crosses Snowflake's revenue within 18-24 months, while trading at a significant premium — a premium that the market is awarding specifically because of the architecture war that Snowflake appears to be losing.

Why Snowflake's Iceberg Pivot Is a Concession Letter

To understand what Snowflake's Apache Iceberg pivot actually means, you need to understand what Snowflake's business was built on: proprietary storage.

Snowflake's performance advantage, through most of its history, came from storing data in its own internal format, optimized for its own query engine. This format was not portable. If you wanted to query Snowflake data with a non-Snowflake tool, you exported it — a friction-generating, expensive process that made leaving harder. Snowflake's lock-in was architecturally embedded in the storage layer.

Databricks' Delta Lake attacked this directly. Delta Lake offered the performance that enterprises needed while storing data in an open format that any tool could read. Enterprises began choosing Delta Lake specifically because they did not want to be locked into a proprietary format. CIOs who had lived through the Oracle database lock-in era were viscerally allergic to the pattern Snowflake was offering.

Snowflake's announcement of native Iceberg support — completed in 2024 and now a core feature — was an admission that data format portability had become a sales requirement. Enterprises were rejecting proprietary storage on principle. Snowflake had to adopt an open format or lose deals to Databricks on architecture grounds alone.

But the Iceberg pivot created a problem that Snowflake has not resolved. If your data is stored in Iceberg format — which any tool can read — the premium performance justification for Snowflake's pricing becomes harder to defend. You are paying Snowflake to query data that could, theoretically, be queried by any compatible engine. The switching cost that made Snowflake defensible was the proprietary format. The open format preserves optionality for the customer in a way that is structurally bad for Snowflake's retention economics.

Snowflake adopted Iceberg because it had to. Databricks forced the architecture war into territory where the open format was the only viable answer. That is the definition of winning a strategic battle even before the financial metrics fully reflect it.

The MLflow Effect: Why Open Source Creates Distribution That Money Cannot Buy

The story of MLflow illustrates why the open-core model generates distribution advantages that no marketing budget can replicate.

MLflow was released by Databricks as open source in 2018. It is a platform for managing the machine learning lifecycle — experiment tracking, model versioning, deployment management. By 2023, it had been downloaded over 17 million times per month. Every major cloud provider supports it. Every major ML framework integrates with it. It is the de facto standard for ML experiment tracking at organizations that take ML seriously.

Databricks owns MLflow. It never locked MLflow to the Databricks platform — you can run MLflow anywhere, on any infrastructure. But the engineers who use MLflow at their companies are the same engineers who evaluate Databricks' commercial platform when their company needs managed ML infrastructure. The brand association is pre-loaded. The trust is pre-built.

This is the distribution flywheel that is impossible to replicate through paid channels:

Open-source a tool that solves a real problem
Developers adopt it because it is free and technically excellent
Developers advocate for it internally because they are already using it
Enterprises pay for managed versions because their developers are already embedded in the ecosystem
Enterprises cannot easily replace the tool because their developers built their workflows around it

The customer acquisition cost for enterprises that arrive through this flywheel is effectively zero. The contract value is identical to enterprise deals acquired through traditional sales motions. The margin difference is permanent.

Databricks' estimated sales and marketing spend as a percentage of revenue — approximately 32% in 2025 — is materially lower than Snowflake's 38% and significantly below the 45-55% typical for high-growth enterprise SaaS. The open-source distribution advantage is showing up directly in the unit economics.

The Governance Tax and Why It Sticks

Critics of the open-core model focus on the fork risk: if the open-source layer is good enough, a community could fork it and build a competitor that undercuts the commercial provider on price. This happened to MySQL (MariaDB), Redis (Valkey), and Elasticsearch (OpenSearch). It is a real risk.

But Databricks has structured its open-core strategy to minimize fork risk through a specific mechanism: the governance layer.

You can fork Spark. Multiple companies have, including Google (Dataproc), Amazon (EMR), and a dozen independent vendors. You cannot fork Unity Catalog in any meaningful sense, because Unity Catalog's value is not the software — it is the accumulated metadata, the organization-specific policies, the years of lineage data that the software manages.

This is the crucial insight that distinguishes Databricks' open-core strategy from less successful implementations. The open-source layer (Spark, Delta Lake, MLflow, DBRX) is the commodity that drives adoption. The proprietary layer (Unity Catalog, the managed compute platform, the enterprise support and compliance infrastructure) is the moat that drives retention.

The governance tax is real, and enterprise customers understand they are paying it. A large financial institution that has spent 18 months mapping its data lineage in Unity Catalog, building access policies for 200 data assets, and building compliance reporting against those policies has made a rational economic calculation: the cost of rebuilding that governance work on a different platform exceeds the premium Databricks charges. The tax is the moat.

The resentment is also real. Enterprise IT teams regularly complain in analyst surveys about Databricks' pricing leverage. Gartner's 2025 Magic Quadrant for Cloud Database Management Systems noted that "customers consistently cite high cost and pricing complexity as primary concerns with Databricks." This resentment is the inevitable consequence of a successful lock-in strategy. The customers who complain loudest are also the ones who renew.

The Imitators: Why This Playbook Is Everywhere but Rarely Executed as Well

The open-core playbook is now the dominant go-to-market template in developer-facing enterprise software. The list of companies following variants of the Databricks model is extensive:

Company	Open-Source Layer	Monetized Layer	2025 ARR
Elastic	Elasticsearch	Elastic Cloud, Security	~$1.2B
Confluent	Apache Kafka	Confluent Cloud, Stream Governance	~$900M
MongoDB	MongoDB Community	Atlas, Enterprise Advanced	~$2.0B
HashiCorp	Terraform, Vault	HCP, Terraform Cloud	~$700M
Grafana	Grafana OSS, Loki	Grafana Cloud, Enterprise Stack	~$300M
dbt Labs	dbt Core	dbt Cloud	~$150M

The pattern is consistent: open-source the data layer or compute layer, monetize the management and governance layer. But the execution quality varies dramatically, and the gap between Databricks and the imitators reveals what makes the playbook work at scale.

The first differentiator is sequence. Databricks did not open-source Spark and immediately try to sell Unity Catalog. It spent a decade building an ecosystem — developers, documentation, integrations, enterprise familiarity — before layering proprietary governance on top. Companies that try to accelerate the sequence find that enterprises are not willing to pay governance premiums for open-source projects without sufficient adoption depth.

The second differentiator is the stickiness gradient. Databricks' four-layer architecture creates increasing stickiness at each level: Spark is highly portable, Delta Lake is moderately portable, Unity Catalog is minimally portable, and AI training workflows built on Mosaic ML are effectively non-portable. This gradient ensures that enterprises enter the ecosystem at the low-friction, high-trust open-source layer and migrate toward the high-friction, high-value proprietary layers over time.

The third differentiator is technical excellence at the open-source layer. Spark genuinely was the best distributed computing framework when it was released. Delta Lake genuinely improved upon the alternatives. Companies that open-source mediocre software and expect the monetization layer to carry the business fail because the ecosystem never develops in the first place.

Confluent is the closest comparable to Databricks in execution quality, having built a $900M ARR business on the same foundation of an Apache project (Kafka) that it contributed to and still largely governs. But Confluent's lock-in mechanism — the cloud-native managed Kafka service plus Schema Registry and Stream Governance — is less structurally sticky than Unity Catalog because event streaming data governance is inherently less complex than general-purpose data governance.

The AI Training Bet: Can the Playbook Scale One More Time?

The $1.3B Mosaic ML acquisition was Databricks' bet that the open-source playbook can be extended to the AI era. The thesis deserves scrutiny.

The data infrastructure market that Databricks has dominated has specific characteristics: the data is organizational, large in volume, and effectively permanent. Once an enterprise's transaction data, clickstream data, and operational data is in Delta Lake, it stays there because moving it is expensive and risky. The switching costs compound over time.

AI training data has different characteristics. Training datasets are often assembled specifically for a training run, curated from multiple sources, and may not represent an ongoing organizational asset in the same way that operational data does. The stickiness of AI training infrastructure may be more dependent on model quality and infrastructure performance than on data gravity.

Databricks is betting that enterprise fine-tuning — training models on proprietary organizational data that already lives in Delta Lake — will be the dominant AI training use case, and that Unity Catalog's governance of that training data will create the same lock-in dynamic that Delta Lake's governance created for analytical data. This is a coherent thesis. The evidence from early enterprise AI deployments suggests that fine-tuning on proprietary data is, in fact, where most enterprise AI value is captured.

The risk is that the AI infrastructure market consolidates around cloud providers — AWS SageMaker, Google Vertex AI, Azure ML — rather than independent platforms. Enterprise AI training at scale requires GPU infrastructure that cloud providers can supply more cheaply than Databricks, which operates on top of the same clouds. Mosaic ML's training efficiency improvements may be durable competitive advantages or temporary ones as cloud providers close the gap.

The DBRX open-sourcing in March 2024 is the clearest signal of Databricks' strategic intent. DBRX was among the most capable open-weight models at the time of its release — surpassing several comparable-size models on key benchmarks. Making it free was a deliberate replication of the Spark strategy: build developer trust through open-source excellence, then monetize the infrastructure required to deploy and fine-tune the model at enterprise scale.

If this bet works, Databricks at $62B will look cheap. The AI training market is projected to reach $75-100B by 2030, and a company that owns the data governance layer for AI training data is structurally positioned to capture a significant fraction of that market.

What Comes After $62B

The open-core model has a ceiling, and Databricks is approaching its contours.

The ceiling is not revenue — $2.4B growing at 50% has significant runway. The ceiling is ecosystem saturation. Every major enterprise data infrastructure buyer has evaluated Databricks. The growth from new customer acquisition is slowing relative to expansion revenue from existing customers. The next phase of Databricks' growth depends on two vectors: winning more wallet share from existing customers through the AI training expansion, and defending the Unity Catalog governance moat against credible challengers.

The governance moat challengers are emerging. AWS Lake Formation and Google Dataplex are Google and Amazon's answers to Unity Catalog, backed by cloud provider distribution and pricing power that independent platforms cannot match. Microsoft Purview is aggressively expanding its governance capabilities. None of these are yet as capable as Unity Catalog for Databricks-centric environments, but the gap is narrowing.

The IPO question is the most frequently discussed variable. Databricks has been preparing for a public offering for two years, and the $62B valuation reflects the expectation of a listing. An IPO would provide liquidity for early investors and employees, validate the business model in the public markets, and give Databricks the currency to make acquisitions. It would also subject the company to quarterly reporting requirements that would make the revenue trajectory publicly visible.

The more important question for the enterprise software industry is whether the playbook Databricks perfected can be applied to the next layer of the stack. AI governance — the equivalent of Unity Catalog for AI models rather than data — is a nascent market that follows the exact same logic. Open-source the model, open-source the evaluation tooling, then charge enterprises for the governance layer that manages model versions, tracks AI output lineage, enforces AI access policies, and provides the audit trails that regulators are beginning to require.

That market does not yet have a Databricks. The company that executes the open-core playbook for AI governance — with the same rigor and sequencing that Databricks applied to data governance — will build the next $62B company.

The Enduring Principle

The Databricks story is ultimately about a strategic insight that is simple to state and difficult to execute: in enterprise software, the product that users love does not need to be the product you sell. You just need to own the layer that sits between the thing users love and the organization that needs to govern it.

Users love Spark. Enterprises need governance of Spark clusters. Users love Delta Lake. Enterprises need ACID compliance and access controls on Delta Lake tables. Developers love MLflow. Enterprises need audit trails and model versioning at scale.

Databricks gave engineers the tools they wanted and sold CIOs the compliance they needed. The engineers made the deployment decision. The CIOs made the contract decision. By satisfying both audiences simultaneously — but with different products at different price points — Databricks created a sales motion where the adoption and the monetization reinforce each other rather than competing.

The open-source bait-and-switch is not a deception. The open-source software is genuinely valuable and genuinely free. The enterprise features are genuinely worth paying for. The "switch" is not from free to paid on the same product — it is from a problem you did not know you had to a solution you cannot build yourself.

Snowflake built a business by solving the same problem — enterprise data management — with a proprietary approach. It got to $4B in revenue before the architecture war caught up with it. The Iceberg pivot is Snowflake's admission that the open-core model won. The question is whether Snowflake's pivot came in time to remain competitive, or whether the data gravity that Databricks has accumulated over a decade has already decided the outcome.

At $62B, the market has a view. Databricks gave away the engine. It owns the dashboard. That trade, executed across four product layers over twelve years, turned free software into one of the most defensible businesses in enterprise technology.

Frequently Asked Questions

How did Databricks reach a $62 billion valuation?

Databricks reached a $62 billion valuation through a combination of rapid revenue growth (approximately $2.4B in annualized revenue as of early 2026, up from $1.6B in 2024), a defensible open-core business model, and the strategic acquisition of Mosaic ML in 2023 for $1.3B. The company's open-source contributions — Apache Spark, Delta Lake, MLflow — created massive developer adoption at zero acquisition cost, and then Databricks monetized the governance and management layers that enterprises require on top of those open-source foundations. The $62B valuation reflects approximately 26x forward revenue, consistent with high-growth enterprise data infrastructure companies.

What is the open-core business model and why is it effective?

The open-core model involves open-sourcing the foundational compute or runtime layer of a software product — which eliminates switching costs and drives bottom-up developer adoption — while charging for proprietary management, governance, security, and support layers on top. The model works because: (1) open-source adoption provides zero-cost distribution at scale, (2) enterprises that adopt the open-source layer inevitably need the enterprise features that only the original vendor provides, and (3) the governance and metadata layers are structurally stickier than the compute layer. Databricks executed this across four successive layers: Spark, Delta Lake, Unity Catalog, and Mosaic ML, each time expanding the surface area of monetizable enterprise features.

Why is Unity Catalog more important than Databricks' compute platform?

Unity Catalog is the metadata and governance layer that sits across all of Databricks' compute. Once an enterprise maps its data assets, access policies, lineage, and compliance rules into Unity Catalog, switching away from Databricks requires not just migrating compute workloads but re-building the entire governance architecture. This makes Unity Catalog dramatically stickier than the Spark or Delta Lake layers, which are technically portable. Governance metadata — data lineage, access policies, audit trails, semantic tags — is organizational knowledge that cannot be easily exported or replicated on another platform. It is the enterprise equivalent of a CRM's contact history: the accumulation is the moat.

What does Snowflake's pivot to Apache Iceberg mean for the competitive landscape?

Snowflake's announcement that it would natively support Apache Iceberg — the open table format that competes with Databricks' Delta Lake — is a strategic concession. It acknowledges that data gravity is shifting toward open formats that customers own and control, rather than proprietary formats that lock data inside a vendor's platform. Snowflake adopted Iceberg because it was losing deals to Databricks on architecture grounds: enterprises were choosing Delta Lake specifically because it is open and portable. By supporting Iceberg, Snowflake validated the open-format thesis. But it also complicated its own lock-in story, since the primary reason to pay Snowflake's premium was proprietary performance on proprietary storage. The Iceberg pivot buys Snowflake table-stakes parity; it does not change the strategic momentum in Databricks' favor.

How does the Mosaic ML acquisition position Databricks for AI?

The $1.3B Mosaic ML acquisition in 2023 gave Databricks LLM training and fine-tuning capabilities — specifically, the MPT model series and the MosaicML training platform — that slots directly into the enterprise data workflow. The strategic logic is a replay of the Spark-to-Delta Lake playbook: enterprises already running data workloads on Databricks can now train and fine-tune models on the same platform, using the same data governance layer (Unity Catalog), without moving data to an external AI vendor. This eliminates the data-export step that most enterprise AI projects require and positions Databricks as the single platform for data engineering, analytics, and AI model training. As AI training workloads scale, Databricks captures a larger share of enterprise compute spend without any additional customer acquisition cost.