DeepSeek Spent $5.6M Training a Model That Rivals GPT-4. The AI Cost Curve Just Broke.
A 150-person team in Hangzhou trained a 671-billion-parameter model for less than the cost of a Series A. NVIDIA lost $589 billion in a single day. Open-source models now match frontier performance at 1/100th the cost. The entire AI industry's margin thesis just got rewritten -- and the Jevons Paradox says demand will only accelerate.
On January 20, 2025, a company most of the Western tech world had never heard of released an AI model that matched or exceeded GPT-4 on every major benchmark -- for roughly 1/14th the training cost. Seven days later, NVIDIA lost $589 billion in market capitalization in a single trading session, the largest single-day loss for any company in US stock market history.
The company was DeepSeek. The model was R1. The training bill was $5.6 million.
That number -- $5.6 million -- broke something fundamental in the AI industry's economic assumptions. Not because it was cheap. Because it was cheap and good. DeepSeek R1 scored 90.8% on MMLU versus GPT-4's 87.2%. It scored 79.8% on the AIME 2024 math competition versus GPT-4's 9.3%. It scored 97.3% on MATH-500. A 150-person team in Hangzhou, funded by a hedge fund, trained a 671-billion-parameter model that outperformed a model backed by over $13 billion in Microsoft investment.
This is the story of how the AI cost curve broke, what it means for every company building on foundation models, and why the economic consequences are the opposite of what most investors initially assumed.
The DeepSeek Origin Story: A Hedge Fund's Side Project
DeepSeek was founded by Liang Wenfeng, co-founder and chief executive of High-Flyer, a Chinese quantitative hedge fund managing approximately $8 billion in assets. High-Flyer had been accumulating Nvidia GPUs for years to run quantitative trading models. When the large language model wave hit in 2023, Liang redirected a portion of that compute toward building foundation models.
The organizational structure is unusual by Silicon Valley standards. DeepSeek operates with roughly 150-200 employees total. The core model team that built R1 comprised just 63 people, according to the R1 technical report's author list. There is no massive go-to-market apparatus. No enterprise sales team. No $200 million Series C. The company's 2025 revenue was $13.4 million -- less than what most frontier AI labs spend on a single training run.
But Liang wasn't optimizing for revenue. He was optimizing for research output per dollar. And the results suggest he found something the rest of the industry missed.
The Architecture: 671 Billion Parameters, 37 Billion Active
DeepSeek R1's headline parameter count is 671 billion. But the model uses a Mixture-of-Experts (MoE) architecture that activates only 37 billion parameters per token. This is the single most important technical detail in the entire DeepSeek story, because it explains how the economics work.
In a dense model like GPT-4 (estimated at 1.8 trillion parameters across its mixture), every parameter is active for every token. That means every forward pass through the network requires computation across the full parameter space. In an MoE model, specialized "expert" sub-networks handle different types of inputs, and a learned routing mechanism selects which experts to activate for each token. The result: you get the knowledge capacity of a 671B-parameter model with the inference cost of a 37B-parameter model. The savings are not incremental. They are structural -- baked into the architecture itself.
DeepSeek also introduced several engineering innovations that compounded the efficiency advantage. Multi-head latent attention reduced the key-value cache during inference, lowering memory requirements. A novel load-balancing strategy across experts minimized wasted computation. FP8 mixed-precision training squeezed maximum throughput from each GPU hour. None of these techniques were individually revolutionary. Combined, they produced a training pipeline that extracted dramatically more capability per dollar of compute than any comparable system.
DeepSeek V3 -- the base model that R1 was built on -- was trained on 14.8 trillion tokens over approximately two months using 2,048 Nvidia H800 GPUs. The total compute cost for the final training run was $5.576 million, based on 2.788 million H800 GPU hours at an estimated $2 per GPU hour. R1 itself was then trained on top of V3 using reinforcement learning, adding additional cost but still keeping the total budget far below what any Western lab has spent on a frontier model.
For context, here is what that looks like against the rest of the industry:
| Model | Estimated Training Cost | Organization |
|---|---|---|
| GPT-4 | $78-100M+ | OpenAI |
| GPT-5 | $500M per run, $1.25-2.5B total | OpenAI |
| Gemini Ultra | $30-50M (estimated) | |
| Llama 3.1 405B | $60-100M (estimated) | Meta |
| DeepSeek V3/R1 | $5.6M | DeepSeek |
That is not a marginal cost advantage. It is an order-of-magnitude structural break.
The Benchmark Results: What $5.6 Million Buys
The benchmark performance is what turned DeepSeek from a curiosity into a crisis for incumbent AI labs. The numbers, drawn from DeepSeek's technical report and independent evaluations:
MMLU (Massive Multitask Language Understanding): DeepSeek R1 scored 90.8%. GPT-4 scored 87.2%. This is the standard benchmark for broad knowledge and reasoning across 57 academic subjects.
AIME 2024 (American Invitational Mathematics Examination): R1 scored 79.8%. GPT-4 scored 9.3%. This is not a typo. On a competition-level math exam, DeepSeek outperformed GPT-4 by over 70 percentage points.
MATH-500: R1 scored 97.3%, demonstrating near-perfect performance on a comprehensive mathematics benchmark.
The subsequent model, DeepSeek V3.2-Speciale, pushed the frontier further. It scored 96.0% on AIME -- beating GPT-5-High's score of 94.6% on the same benchmark. A Chinese open-source model, built by a team smaller than most Series A startups, was outperforming OpenAI's flagship next-generation model on competitive mathematics.
These results are not cherry-picked for favorable benchmarks. R1 matches or exceeds GPT-4 across reasoning, coding (Codeforces rating 2,029), and general knowledge tasks. On coding specifically, R1 achieved a 2,029 Elo rating on Codeforces -- placing it in the top tier of competitive programmers and well above GPT-4's performance on equivalent coding benchmarks. On the LiveCodeBench benchmark, which tests real-world coding ability, R1 again outperformed GPT-4o.
The areas where R1 trails closed models -- certain creative writing tasks, nuanced instruction following, and multilingual edge cases -- are precisely the areas where benchmark measurement is weakest and where subjective human preference plays the largest role. For the use cases that enterprise customers care about most -- data analysis, code generation, mathematical reasoning, and structured information extraction -- DeepSeek R1 is not just competitive. It is, by the numbers, superior to a model that cost 14-18x more to build.
The DeepSeek Shock: $589 Billion in a Day
January 27, 2025, was a Monday. It was the first US trading day after DeepSeek R1 went viral over the weekend. By market close, NVIDIA had fallen approximately 17%, wiping out $589 billion in market capitalization -- the largest single-day loss for any US company in history.
The total damage to US tech stocks that day was roughly $1 trillion. Broadcom dropped 17.4%. ASML fell 7%. The Nasdaq Composite dropped 3.1%. Siemens Energy, which had rallied on AI data center power demand, fell 20%. The sell-off was concentrated in the AI infrastructure complex -- the companies whose valuations depended on the assumption that training frontier models required billions of dollars in compute.
The logic behind the panic was straightforward: if DeepSeek could train a GPT-4-class model for $5.6 million, then the $100+ billion in planned AI infrastructure spending by Microsoft, Google, Amazon, and Meta might be dramatically overstated. Why would hyperscalers spend $60 billion each on GPU clusters if the models could be trained for 1/100th the price? Analysts at Bernstein called it "AI's Sputnik moment." SoftBank's Masayoshi Son compared it to the shock Japan felt when China first demonstrated advanced semiconductor capabilities.
But the panic was wrong. Or rather, it was asking the wrong question. The right question was not "will companies spend less on AI infrastructure?" It was "what happens when AI becomes 100x cheaper to deploy?"
The Recovery: Why NVIDIA Hit $5 Trillion Anyway
NVIDIA recovered its entire loss within less than a month. By October 2025, NVIDIA's market cap reached $5.03 trillion, making it the world's most valuable company. The stock didn't just recover -- it went on a historic run.
The reason is a concept that Jensen Huang articulated repeatedly in the weeks after the crash: the Jevons Paradox. Named after the 19th-century economist William Stanley Jevons, who observed in 1865 that improvements in steam engine efficiency increased total coal consumption rather than decreasing it, the paradox states that when a resource becomes cheaper to use, total demand rises faster than per-unit consumption falls.
Applied to AI: if training costs drop 100x, you don't get 100x less spending on training. You get 100x more models being trained. If inference costs drop 280x, you don't get 280x less spending on inference. You get inference embedded in every application, every workflow, every device -- consuming orders of magnitude more total compute.
Huang pointed out that reasoning models consume 100x more compute than standard inference. A standard chatbot query might generate 500-1,000 tokens. A chain-of-thought reasoning query generates 10,000-50,000 tokens. A multi-agent workflow orchestrating several models might generate 100,000+ tokens to complete a single task. When inference is cheap enough to run these architectures at scale -- when a 100,000-token reasoning chain costs $0.007 instead of $2.00 -- developers build systems that were previously economically impossible. Total demand does not decrease. It explodes.
The macro numbers confirm this. AI is projected to consume 20% of US electricity by 2030, up from approximately 4% today. Data center construction in the US alone reached $28 billion in 2024, with Goldman Sachs projecting $35-45 billion annually through 2028. You do not quintuple electricity consumption and triple infrastructure spending if cheaper AI reduces demand.
The market understood this within weeks. The DeepSeek Shock was not a demand destruction event. It was a demand creation event. Every dollar saved on training was a dollar that could fund ten new experiments. Every 10x reduction in inference cost opened up a new category of application. The cost curve broke downward, and the demand curve broke upward. That is the Jevons Paradox in action.
The Inference Cost Collapse: 280x in Two Years
The DeepSeek story fits into a broader cost collapse that has been accelerating since 2022. Between November 2022 and October 2024, the cost of LLM inference dropped approximately 280x -- from roughly $20 per million tokens to $0.07 per million tokens. The rate of decline: approximately 10x per year, far outpacing Moore's Law.
Current API pricing tells the story:
| Provider | Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|---|
| DeepSeek | V3 | $0.28 | $0.42 |
| OpenAI | GPT-5.2 | $1.75 | $3.50 |
| Anthropic | Claude Opus | $5.00 | $25.00 |
| OpenAI | GPT-4o | $2.50 | $10.00 |
DeepSeek's API pricing is 20-50x cheaper than frontier closed models. For a company processing 100 million tokens per day, that is the difference between a $15,000 monthly inference bill and a $750,000 one. At enterprise scale, the margin impact is existential.
This cost collapse is not just about DeepSeek. It reflects a structural trend: open-source and open-weight models are commoditizing the inference layer. The decline follows a predictable curve -- roughly 10x per year -- driven by algorithmic improvements, hardware efficiency gains, quantization techniques, and competitive pressure from open-source alternatives. When any developer can deploy a GPT-4-class model on their own infrastructure for pennies per query, the value shifts from the model to the application layer -- the workflow, the data, the user experience built on top.
For enterprise buyers, the pricing implications are immediate and measurable. A mid-size SaaS company processing 500 million tokens per month would pay approximately $140 using DeepSeek's API, $875 using GPT-5.2, and $2,500 using Claude Opus. At 5 billion tokens per month -- typical for a company with AI features embedded across multiple products -- the gap widens to $1,400 versus $8,750 versus $25,000. These are not rounding errors. They are the difference between AI features being a profit center and a cost center.
The Open-Source Convergence: 89.6% of Closed Performance
The most strategically significant finding in the past 18 months is how fast open-source models are converging with closed frontier models. The data is unambiguous:
- Open-source models now average 89.6% of closed-model performance across standard benchmarks
- On MMLU, the gap between the best open and closed models shrank from 17.5 points to 0.3 points in a single year
- The average time for an open-source model to match a new closed-model benchmark result dropped from 27 weeks to 13 weeks
- Alibaba's Qwen model family has surpassed 700 million downloads on Hugging Face with over 113,000 derivative models built on top
- Chinese-origin models overtook US-origin models in total Hugging Face downloads by summer 2025
This convergence has a compounding dynamic. Every time an open-source model achieves a new capability, thousands of developers fine-tune it, distill it, and deploy it. The 113,000+ derivative models built on Qwen represent 113,000 experiments in optimization that feed back into the broader ecosystem. Closed-model labs cannot match this distributed R&D effort at any price.
DeepSeek R1 itself is the proof case. As an open-weight model, it has been fine-tuned for legal analysis, medical diagnosis, financial modeling, and dozens of other vertical applications within weeks of release. Each derivative model makes the open ecosystem more valuable -- and makes the premium that closed-model providers can charge harder to justify.
The speed of this convergence has stunned even optimistic open-source advocates. In January 2024, the best open-source model (Mixtral 8x7B) trailed GPT-4 by double digits on most benchmarks. By January 2025, DeepSeek R1 had closed -- and in some cases reversed -- that gap entirely. The implication for closed-model providers is stark: every new capability you ship becomes an open-source capability within one quarter. Your research budget is, in effect, an R&D subsidy for the entire ecosystem.
Meta's Reversal: The Limits of Open Source at Scale
If open source is winning, why did Meta reverse course?
In mid-2025, after the disappointing reception of Llama 4, Meta began developing a proprietary model internally codenamed "Avocado." Mark Zuckerberg reportedly authorized compensation packages exceeding $100 million to recruit top AI researchers from Google DeepMind and OpenAI.
The shift reflects a hard truth about the economics of open-source AI at the frontier. Meta spent an estimated $60-100 million training Llama 3.1 405B. It received significant goodwill, developer adoption, and ecosystem benefits. But it did not receive revenue. When competitors like DeepSeek can match your open-source output at 1/10th the cost, the strategic value of releasing models openly starts to diminish. You are subsidizing an ecosystem that benefits everyone except your shareholders.
Meta's pivot does not invalidate the open-source convergence thesis. It validates it. If open-source models from DeepSeek, Qwen, and others are reaching frontier performance without Meta's subsidy, then Meta's open-source investment is no longer a competitive differentiator. The rational response is to go proprietary where you have unique advantages -- data, distribution, integration with 3.9 billion monthly active users -- and let the open-source ecosystem commoditize the base layer on its own.
The Geopolitical Dimension: Export Bans and Chip Smuggling
DeepSeek's success has a geopolitical dimension that cannot be separated from the technical story.
The Biden administration banned the export of Nvidia H800 GPUs to China in October 2023. The H800 was itself a downgraded version of the H100, designed specifically to comply with earlier export controls. DeepSeek trained R1 on H800 GPUs that were acquired before the ban took effect -- High-Flyer had been stockpiling hardware for its quantitative trading operations.
The Trump administration reversed the ban in December 2025, citing concerns that export controls were accelerating Chinese self-sufficiency in chip design rather than constraining it. The DeepSeek models served as Exhibit A: the ban was supposed to prevent China from building competitive AI systems, and instead China produced models that outperformed American ones on key benchmarks.
DeepSeek is reportedly under investigation for potential chip smuggling -- specifically, whether H100 or A100 GPUs banned under export controls were used in training. The company has denied this. Singapore-based intermediaries and cloud providers have also faced scrutiny for potentially facilitating access to restricted chips.
Regardless of the investigation's outcome, the strategic implication is clear: export controls did not prevent China from reaching frontier AI capability. They may have accelerated the efficiency innovations that made DeepSeek possible by forcing Chinese labs to extract maximum performance from constrained hardware. When you cannot buy the top-tier chip, you build better software to compensate. DeepSeek's MoE architecture, its FP8 training pipeline, and its memory-efficient attention mechanisms all bear the fingerprints of a team engineering around hardware constraints rather than throwing compute at the problem.
The Data Wall: Where Efficiency Meets Its Limit
The efficiency gains that made DeepSeek possible may face a natural ceiling. Epoch AI projects that high-quality text data -- the raw material for pre-training large language models -- will be substantially exhausted between 2026 and 2028. The internet generates enormous quantities of text daily, but the subset that is high-quality, diverse, and suitable for training is finite and increasingly picked over.
This data wall affects all model developers, open and closed. But it disproportionately affects companies pursuing the "scale is all you need" strategy -- training ever-larger models on ever-larger datasets. If the data runs out, scaling laws hit a ceiling, and the returns to additional compute diminish sharply.
DeepSeek's approach -- achieving frontier performance through architectural efficiency rather than brute-force scale -- may prove prescient. The MoE architecture, aggressive distillation, and optimization techniques that produced R1 are data-efficient strategies. They extract more capability per training token. If the data wall arrives on schedule, the labs that optimized for efficiency rather than scale will have a structural advantage.
The industry is already responding. Synthetic data generation -- using existing models to create training data for new models -- has emerged as a partial solution. But synthetic data introduces its own risks: model collapse, where training on AI-generated text degrades output quality over successive generations. The labs that navigated this challenge most effectively in 2025 were, again, the ones focused on efficiency -- extracting more signal from less data, rather than drowning the problem in volume.
High-Flyer's Returns: The Hedge Fund Connection
The financial returns to DeepSeek's parent company tell their own story. High-Flyer's quantitative hedge funds surged 57% in 2025, a performance that coincides with -- and is likely partially driven by -- access to frontier AI models for trading strategy development.
This creates a unique funding model. Most AI labs burn cash: OpenAI's annual expenses exceed $8.5 billion, Anthropic has raised over $15 billion in venture capital. DeepSeek's parent company generates its own capital through fund returns. The AI lab is effectively self-funding, with a hedge fund as the cash flow engine and the AI models serving dual purposes -- commercial API revenue ($13.4 million in 2025) and proprietary trading edge.
It is a model that no Silicon Valley AI lab can replicate, because no Silicon Valley AI lab is attached to an $8 billion hedge fund that benefits directly from the models it builds. The misalignment between investor expectations and research timelines that plagues companies like OpenAI and Stability AI does not exist at DeepSeek. The research pays for itself through a different revenue stream entirely.
What This Means for the AI Industry's Margin Structure
The DeepSeek shock rewrites three assumptions that underpinned the AI industry's financial model:
Assumption 1: Frontier AI requires frontier capital. DeepSeek proved this wrong. $5.6 million in compute, 63 researchers, and architectural innovation produced a model that rivals systems built with 100x the budget. The implication: the barrier to entry for building competitive AI models is collapsing. The number of organizations capable of training frontier-class models is about to expand dramatically.
Assumption 2: Closed-model providers can sustain premium pricing indefinitely. When open-source models deliver 89.6% of closed-model performance at 1/20th to 1/50th the price, the pricing power of closed-model APIs erodes. OpenAI's revenue ($12.7 billion annualized as of late 2025) depends on enterprise customers paying premium prices for marginal performance advantages. As the open-source gap shrinks from 10% to 5% to 2%, the willingness to pay that premium will shrink with it. The analogy is cloud computing in the 2010s: early cloud providers charged substantial premiums, but commoditization drove margins down relentlessly. The same dynamic is now playing out in AI model APIs, just faster -- compressed from a decade to 18 months.
Assumption 3: AI infrastructure spending is a bubble. This is the assumption the market made on January 27, 2025, when it wiped $1 trillion from US tech stocks. And it was the assumption the market reversed within weeks. The Jevons Paradox is real. Cheaper AI does not mean less infrastructure spending. It means more AI deployed in more places, consuming more total compute. The infrastructure buildout is not a bubble -- it is an underestimate.
The 13-Week Countdown
Perhaps the most consequential number in this entire analysis is 13. That is the average number of weeks it now takes for an open-source model to match a newly released closed-model benchmark. Down from 27 weeks just a year earlier. Shrinking every quarter.
This number should be alarming to every closed-model provider. It means that any proprietary advantage a closed-model lab establishes is now a depreciating asset with a half-life of roughly three months. OpenAI releases GPT-5 in September. By December, open-source alternatives match its performance on most benchmarks. By March, they exceed it on several. The $500 million you spent on that training run bought you a 90-day head start -- and the head start is getting shorter.
The dynamic is asymmetric in a way that favors open source structurally. When OpenAI or Anthropic publishes a technical paper describing a new technique -- or when independent researchers reverse-engineer a capability improvement through benchmark analysis -- the open-source community can implement that technique across dozens of model families simultaneously. One research insight from a closed lab becomes a capability improvement across hundreds of open-source models. The closed lab gets a brief lead. The ecosystem gets a permanent upgrade.
This is already visible in the data. DeepSeek V3.2-Speciale, scoring 96.0% on AIME, did not just match GPT-5 -- it beat GPT-5-High's 94.6%. The response from the open-source community was not surprise. It was expectation. The 13-week countdown had, in that case, compressed to less than 8 weeks.
What Comes Next
The implications for competitive strategy are severe and immediate. If your moat is model performance, you have 13 weeks of runway -- and that window is closing. If your moat is data, distribution, workflow integration, or user trust, you have something more durable. The companies that survive the cost curve break will be those that treat model intelligence as an input -- a commodity utility, like electricity or bandwidth -- and build differentiated value in the layers above it.
OpenAI's pivot to consumer products (ChatGPT as a platform, with memory, plugins, and agentic features) is one response. Anthropic's focus on safety and enterprise trust is another. Google's integration of Gemini across Search, Workspace, and Cloud is a third. Each is an acknowledgment that the model alone is not enough.
The DeepSeek story is not just about one model from one Chinese lab. It is about the structural economics of intelligence becoming a commodity -- and the race to build defensible businesses on top of a layer that is rapidly approaching zero marginal cost. A 150-person team in Hangzhou spent $5.6 million and produced a model that rivaled the output of organizations spending 100x more. The gap between what is possible and what it costs to achieve it has never been wider -- and it is widening every quarter.
The cost curve did not bend. It broke. And the companies that understand the Jevons Paradox -- that cheaper intelligence creates more demand for intelligence, not less -- will be the ones that capture the value on the other side.
Frequently Asked Questions
What is DeepSeek R1 and who made it?
DeepSeek R1 is a 671-billion-parameter large language model released on January 20, 2025, by DeepSeek, an AI lab based in Hangzhou, China. The company was founded by Liang Wenfeng, co-founder of High-Flyer, a quantitative hedge fund managing approximately $8 billion in assets. DeepSeek operates with roughly 150-200 employees and a core model team of just 63 people. R1 uses a Mixture-of-Experts (MoE) architecture that activates only 37 billion parameters per token, making it far more efficient than dense models of comparable size. It was trained on 2,048 Nvidia H800 GPUs for approximately 2.788 million GPU hours.
How much did DeepSeek R1 cost to train?
DeepSeek R1 cost approximately $5.6 million in compute to train, based on 2.788 million H800 GPU hours. For comparison, GPT-4 is estimated to have cost $78-100 million or more to train, and GPT-5 reportedly cost $500 million per training run with total development costs of $1.25-2.5 billion. That makes DeepSeek R1 roughly 14-18x cheaper than GPT-4 and nearly 90-100x cheaper than GPT-5's total cost. The low training cost was achieved through the MoE architecture, aggressive engineering optimization, and the fact that DeepSeek's parent company High-Flyer had already accumulated significant GPU resources before the US export ban on H800 chips.
How does DeepSeek compare to GPT-4 on benchmarks?
DeepSeek R1 outperforms GPT-4 on several major benchmarks. On MMLU (Massive Multitask Language Understanding), R1 scores 90.8% versus GPT-4's 87.2%. On AIME 2024 (a competitive mathematics exam), R1 scores 79.8% compared to GPT-4's 9.3% -- a gap of over 70 percentage points. On MATH-500, R1 scores 97.3%. The subsequent DeepSeek V3.2-Speciale model scored 96.0% on AIME, beating even GPT-5-High's 94.6%. These results demonstrate that a model trained for $5.6 million can match or exceed models that cost 10-100x more to develop.
What was the DeepSeek stock market crash?
On January 27, 2025 -- the first trading day after DeepSeek R1 gained viral attention -- NVIDIA's stock fell approximately 17%, erasing $589 billion in market capitalization in a single session. This was the largest single-day market cap loss for any company in US stock market history. The broader US tech sector lost roughly $1 trillion in value that day, as investors recalculated whether the massive capital expenditures planned for AI infrastructure were justified if models could be trained at a fraction of the assumed cost. However, NVIDIA recovered fully within less than a month and went on to reach a $5.03 trillion market cap by October 2025, as the market concluded that cheaper AI would drive more demand, not less.
What is the Jevons Paradox in AI?
The Jevons Paradox, originally observed by economist William Stanley Jevons in 1865, states that when a resource becomes more efficient to use, total consumption of that resource increases rather than decreases. In AI, this means that as model training and inference costs decline -- inference costs fell 280x from $20 to $0.07 per million tokens between November 2022 and October 2024 -- total AI compute demand grows dramatically. Jensen Huang has noted that reasoning models consume 100x more compute than standard inference. AI is projected to consume 20% of US electricity by 2030. Cheaper models do not reduce infrastructure spending; they expand the addressable market for AI applications, creating net new demand that exceeds the efficiency gains.
Is open-source AI catching up to closed models?
Yes, and the gap is closing rapidly. Open-source models now average 89.6% of closed-model performance across standard benchmarks. On MMLU specifically, the gap between the best open and closed models shrank from 17.5 points to just 0.3 points in a single year. The average time for an open-source model to match a new closed-model benchmark dropped from 27 weeks to 13 weeks. Alibaba's Qwen family has surpassed 700 million downloads on Hugging Face with over 113,000 derivative models, and Chinese-origin models overtook US-origin models in total Hugging Face downloads by summer 2025. DeepSeek R1 itself, as an open-weight model, demonstrated that frontier-level performance no longer requires frontier-level budgets.