San Francisco, CA - July 17, 2025: AI development shifts past traditional scaling laws, with DeepSeek-v3, OpenAI's o1 model, and new multimodal LLMs like Llama 3.2 driving progress through architectural innovation and reasoning focus.
Open challenges in LLM research · 1. Reduce and measure hallucinations · 2. Optimize context length and context construction · 3. Incorporate other ...
Scaling laws define a relationship, based upon a power law, between training compute (or model / dataset size) and the test loss of an LLM.
Multimodal LLMs are large language models that process multiple types of inputs, such as text, sound, images, and videos.
Sure, here's the rewritten article based on your guidelines!
San Francisco, CA - July 17, 2025 - For years, the steady pursuit of scale - larger models, more data, greater compute - has acted as the main driver behind advances in artificial intelligence, especially large language models (LLMs). This scaling paradigm, often guided by predictable "scaling laws," has become so entrenched that early triumphs at leading AI labs such as OpenAI were ascribed to an almost fanatical belief in its power. Yet, recent breakthroughs and mounting scrutiny indicate this era of predictable, compute-driven performance gains may be entering a substantial transformation, urging a reassessment of future AI development paths.
Scaling laws, at their heart, describe the inverse power-law relationship between an LLM's test loss (or related performance metrics) and growth in specific quantities like model parameters, dataset size, or training compute. As laid out in seminal research, "With enough training data, scaling of validation loss should be approximately a smooth power law as a function of model size" [4]. This predictability has enabled strategic bets on ever more ambitious training runs, producing models with stronger capabilities. Early analysis, such as that by Kaplan et al. [1], showed that LLM performance, measured via test loss on datasets like WebText2, improves smoothly across eight orders of magnitude in compute, six orders in model size, and two orders in dataset size. This foundational work highlighted the principle that optimal performance arises when model size, data, and compute are scaled together, because focusing on a single factor yields diminishing returns.
A common misunderstanding that arises from these scaling laws is the notion that LLM quality rises exponentially with logarithmic increases in compute. In fact, the relationship more closely mirrors an exponential decay when quality is plotted against linear scale increments, meaning that extracting further performance gains becomes exponentially harder. As noted in [1], "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." This efficiency implies larger LLMs can hit comparable performance levels with less data, and training to full convergence may be sub-optimal. For example, an roughly eight-fold increase in model size only demands a five-fold rise in training data to avoid overfitting.
A pivotal re-evaluation of this scaling dynamic arrived with the "Chinchilla" paper [6] in 2022. Its authors argued that optimal compute-efficient training requires proportional scaling of both model and data size. Their results suggested many existing LLMs were "undertrained" for their scale. For instance, specific scaling laws fitted in their study indicated that models like Gopher would have benefited from a dataset twenty times larger. To test this, they trained Chinchilla, a 70-billion-parameter LLM, on 1.4 trillion tokens, showing that despite being four times smaller than Gopher, Chinchilla consistently outperformed it. This highlighted a crucial shift: the "amount of training data that is projected to be needed is far beyond what is currently used to train large models" [6]. The implications of Chinchilla's findings have since become a benchmark in AI research for optimizing pre-training efficiency.
Even with these established principles, the latter half of 2024 saw rising skepticism about the continued potency of scaling, with some positing that AI research-and scaling laws in particular-might be hitting fundamental limits. Major media outlets reflected this mood:
This doubt was amplified by prominent voices in the AI community. Ilya Sutskever, a co-founder of OpenAI, famously said during his NeurIPS '24 test-of-time award speech that "pretraining as we know it will end." In contrast, leaders like Anthropic CEO Dario Amodei and OpenAI CEO Sam Altman have publicly asserted that scaling is likely to persist. As analyst Nathan Lambert observed, "Both narratives can be true: Scaling is still working at a technical level. The rate of improvement for users is slowing." This tension suggests that while the underlying technical mechanics of scaling may still function, translating them into tangible, exponential user-facing gains is becoming increasingly tough.
A key element behind this perceived slowdown is the gap between pre-training test loss and downstream task performance. Scaling laws mainly track the smooth reduction of test loss during pre-training. Yet, as noted in "Practitioners often use downstream benchmark accuracy as a proxy for model quality and not loss on perplexity evaluation sets" [7], the direct impact of marginally lower test loss on an LLM's real-world abilities-especially for general chat or complex downstream tasks-remains vague. Researchers at top labs, for example, often chase highly specific, advanced capabilities such as writing a PhD thesis or solving intricate mathematical problems, which may not map neatly onto modest test-loss improvements.
Another pressing worry is the "data death" hypothesis put forward by Sutskever, who argues that while compute capacity soars, the supply of high-quality, novel pre-training data-largely harvested from web scraping-is not keeping up. This leads to the conclusion that "we have achieved peak data," prompting exploration of alternatives like synthetically generated data to enable future scaling by "several orders of magnitude."
Against this backdrop, innovative approaches are surfacing to reignite progress. Researchers in [7] are crafting "practical scaling laws" intended to predict LLM performance directly on downstream benchmarks, offering a more immediate gauge of utilitarian advancement.
Moreover, the architectural landscape is shifting rapidly. DeepSeek-v3, a 671-billion-parameter Mixture-of-Experts (MoE) model pretrained on 14.8 trillion tokens, exemplifies this, having already eclipsed the performance of leading models like GPT-4o and Claude-3.5-Sonnet. DeepSeek-v3's breakthroughs include an optimized MoE design from DeepSeek-v2, a novel auxiliary-loss-free load-balancing strategy, multi-token prediction training, and distillation of reasoning abilities from long-chain-of-thought models (akin to OpenAI's o1). Such models underscore that scaling now hinges not just on raw parameter count but also on architectural innovation and training techniques. Successfully training models of this magnitude demands not only more GPUs but a sophisticated, multidisciplinary engineering effort. As Ege Erdil notes, "At every order-of-magnitude scale-up, different innovations have to be found."
OpenAI's "o1" reasoning model marks another major departure, concentrating on boosting complex reasoning via reinforcement learning. The o1 model "thinks before it answers," generating an internal chain of thought prior to responding. This approach has yielded striking results, placing it in the 89th percentile on Codeforces competitive programming questions and surpassing human PhD candidates on graduate-level physics, biology, and chemistry problems (GPQA). The subsequent "o3" model, likely a scaled version of o1 with more compute invested in reinforcement learning, has achieved unprecedented scores on benchmarks such as ARC-AGI (87.5 % accuracy, beating human-level performance of 85 %) and SWE-Bench Verified (71.7 % accuracy, an Elo of 2727 on Codeforces). This represents a "completely new scaling paradigm," where progress derives not only from pre-training size but also from "more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)" [22].
Alongside these developments, multimodal LLMs (MLLMs) are rapidly gaining ground. These models fuse multiple data types, or "modalities"-such as text, images, audio, and video-into a single processing framework. Though still emerging, MLLMs open doors to sophisticated applications like image captioning, meme explanation, and extracting information from complex documents.
Architecturally, MLLMs mainly follow two approaches:
Recent work showcases the breadth and rapid evolution of MLLM development:
The architectural components and training methods across MLLMs differ markedly, making direct performance comparisons difficult due to divergent benchmarks and pervasive data contamination. Nonetheless, the overarching takeaway is that multimodal LLMs can be successfully built through a wide array of designs, reflecting a field rich with innovation beyond the classic scaling of text-only models.
The current debate over scaling laws signals a maturation of the AI sector. While investments in massive pre-training will undoubtedly persist, the nature of "progress" is diversifying. The "natural decay in scaling laws" and the "high variance in expectations of LLM capabilities" underscore that merely adding scale is no longer enough. The latency inherent in "large-scale, interdisciplinary engineering efforts" required to achieve the next order of magnitude in core model capability further highlights this. These dynamics, however, do not herald the end of scaling but rather a redirection. Progress will become "exponentially harder over time" along established axes, demanding greater focus on alternative paths such as agents, reasoning models, and the fusion of varied data modalities via advanced multimodal architectures. The fundamental question is shifting from whether AI will keep scaling to what aspects of AI will be scaled next.
Open challenges in LLM research · 1. Reduce and measure hallucinations · 2. Optimize context length and context construction · 3. Incorporate other ...
Scaling laws define a relationship, based upon a power law, between training compute (or model / dataset size) and the test loss of an LLM.
Multimodal LLMs are large language models that process multiple types of inputs, such as text, sound, images, and videos.
Related Questions