Logo

Beyond GPT: Multimodal AI & OpenAI's o1 Redefine Scaling

San Francisco, CA - July 17, 2025: AI development shifts past traditional scaling laws, with DeepSeek-v3, OpenAI's o1 model, and new multimodal LLMs like Llama 3.2 driving progress through architectural innovation and reasoning focus.

17 жовтня 2025 р., 09:41
11 min read

Sure, here's the rewritten article based on your guidelines!

The Shifting Tides of AI: Beyond Scaling Laws to Multimodal Reasoning

San Francisco, CA - July 17, 2025 - For years, the steady pursuit of scale - larger models, more data, greater compute - has acted as the main driver behind advances in artificial intelligence, especially large language models (LLMs). This scaling paradigm, often guided by predictable "scaling laws," has become so entrenched that early triumphs at leading AI labs such as OpenAI were ascribed to an almost fanatical belief in its power. Yet, recent breakthroughs and mounting scrutiny indicate this era of predictable, compute-driven performance gains may be entering a substantial transformation, urging a reassessment of future AI development paths.

Scaling laws, at their heart, describe the inverse power-law relationship between an LLM's test loss (or related performance metrics) and growth in specific quantities like model parameters, dataset size, or training compute. As laid out in seminal research, "With enough training data, scaling of validation loss should be approximately a smooth power law as a function of model size" [4]. This predictability has enabled strategic bets on ever more ambitious training runs, producing models with stronger capabilities. Early analysis, such as that by Kaplan et al. [1], showed that LLM performance, measured via test loss on datasets like WebText2, improves smoothly across eight orders of magnitude in compute, six orders in model size, and two orders in dataset size. This foundational work highlighted the principle that optimal performance arises when model size, data, and compute are scaled together, because focusing on a single factor yields diminishing returns.

A common misunderstanding that arises from these scaling laws is the notion that LLM quality rises exponentially with logarithmic increases in compute. In fact, the relationship more closely mirrors an exponential decay when quality is plotted against linear scale increments, meaning that extracting further performance gains becomes exponentially harder. As noted in [1], "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." This efficiency implies larger LLMs can hit comparable performance levels with less data, and training to full convergence may be sub-optimal. For example, an roughly eight-fold increase in model size only demands a five-fold rise in training data to avoid overfitting.

The Chinchilla Paradigm: A Reassessment of Proportionality

A pivotal re-evaluation of this scaling dynamic arrived with the "Chinchilla" paper [6] in 2022. Its authors argued that optimal compute-efficient training requires proportional scaling of both model and data size. Their results suggested many existing LLMs were "undertrained" for their scale. For instance, specific scaling laws fitted in their study indicated that models like Gopher would have benefited from a dataset twenty times larger. To test this, they trained Chinchilla, a 70-billion-parameter LLM, on 1.4 trillion tokens, showing that despite being four times smaller than Gopher, Chinchilla consistently outperformed it. This highlighted a crucial shift: the "amount of training data that is projected to be needed is far beyond what is currently used to train large models" [6]. The implications of Chinchilla's findings have since become a benchmark in AI research for optimizing pre-training efficiency.

The "Death" of Scaling Laws: A Narrative of Diminishing Returns

Even with these established principles, the latter half of 2024 saw rising skepticism about the continued potency of scaling, with some positing that AI research-and scaling laws in particular-might be hitting fundamental limits. Major media outlets reflected this mood:

  • Reuters reported that OpenAI was tweaking its product strategy because of a perceived plateau in current scaling methods.
  • The Information noted a slowdown in improvement rates for GPT models.
  • Bloomberg highlighted the obstacles faced by frontier labs as they strive to build more capable AI.
  • TechCrunch claimed that scaling was delivering diminishing returns.
  • Time offered a nuanced look at various factors feeding a narrative of slowing AI progress.

This doubt was amplified by prominent voices in the AI community. Ilya Sutskever, a co-founder of OpenAI, famously said during his NeurIPS '24 test-of-time award speech that "pretraining as we know it will end." In contrast, leaders like Anthropic CEO Dario Amodei and OpenAI CEO Sam Altman have publicly asserted that scaling is likely to persist. As analyst Nathan Lambert observed, "Both narratives can be true: Scaling is still working at a technical level. The rate of improvement for users is slowing." This tension suggests that while the underlying technical mechanics of scaling may still function, translating them into tangible, exponential user-facing gains is becoming increasingly tough.

A key element behind this perceived slowdown is the gap between pre-training test loss and downstream task performance. Scaling laws mainly track the smooth reduction of test loss during pre-training. Yet, as noted in "Practitioners often use downstream benchmark accuracy as a proxy for model quality and not loss on perplexity evaluation sets" [7], the direct impact of marginally lower test loss on an LLM's real-world abilities-especially for general chat or complex downstream tasks-remains vague. Researchers at top labs, for example, often chase highly specific, advanced capabilities such as writing a PhD thesis or solving intricate mathematical problems, which may not map neatly onto modest test-loss improvements.

Another pressing worry is the "data death" hypothesis put forward by Sutskever, who argues that while compute capacity soars, the supply of high-quality, novel pre-training data-largely harvested from web scraping-is not keeping up. This leads to the conclusion that "we have achieved peak data," prompting exploration of alternatives like synthetically generated data to enable future scaling by "several orders of magnitude."

New Frontiers: Practical Scaling Laws and Reasoning Models

Against this backdrop, innovative approaches are surfacing to reignite progress. Researchers in [7] are crafting "practical scaling laws" intended to predict LLM performance directly on downstream benchmarks, offering a more immediate gauge of utilitarian advancement.

Moreover, the architectural landscape is shifting rapidly. DeepSeek-v3, a 671-billion-parameter Mixture-of-Experts (MoE) model pretrained on 14.8 trillion tokens, exemplifies this, having already eclipsed the performance of leading models like GPT-4o and Claude-3.5-Sonnet. DeepSeek-v3's breakthroughs include an optimized MoE design from DeepSeek-v2, a novel auxiliary-loss-free load-balancing strategy, multi-token prediction training, and distillation of reasoning abilities from long-chain-of-thought models (akin to OpenAI's o1). Such models underscore that scaling now hinges not just on raw parameter count but also on architectural innovation and training techniques. Successfully training models of this magnitude demands not only more GPUs but a sophisticated, multidisciplinary engineering effort. As Ege Erdil notes, "At every order-of-magnitude scale-up, different innovations have to be found."

OpenAI's "o1" reasoning model marks another major departure, concentrating on boosting complex reasoning via reinforcement learning. The o1 model "thinks before it answers," generating an internal chain of thought prior to responding. This approach has yielded striking results, placing it in the 89th percentile on Codeforces competitive programming questions and surpassing human PhD candidates on graduate-level physics, biology, and chemistry problems (GPQA). The subsequent "o3" model, likely a scaled version of o1 with more compute invested in reinforcement learning, has achieved unprecedented scores on benchmarks such as ARC-AGI (87.5 % accuracy, beating human-level performance of 85 %) and SWE-Bench Verified (71.7 % accuracy, an Elo of 2727 on Codeforces). This represents a "completely new scaling paradigm," where progress derives not only from pre-training size but also from "more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)" [22].

The Rise of Multimodal LLMs: Integrating Diverse Modalities

Alongside these developments, multimodal LLMs (MLLMs) are rapidly gaining ground. These models fuse multiple data types, or "modalities"-such as text, images, audio, and video-into a single processing framework. Though still emerging, MLLMs open doors to sophisticated applications like image captioning, meme explanation, and extracting information from complex documents.

Architecturally, MLLMs mainly follow two approaches:

  1. Unified Embedding Decoder Architecture (Method A): This method converts varied inputs (e.g., images) into tokens that share the same embedding space as text, allowing a standard decoder-only LLM (like GPT-2 or Llama 3.2) to handle them sequentially. Examples include Fuyu, which directly maps image patches into its embedding space without an intermediate pretrained image encoder.
  2. Cross-Modality Attention Architecture (Method B): This strategy employs cross-attention mechanisms, merging image and text embeddings directly within attention layers. It is often seen as more compute-efficient because it avoids overloading the input context with extra image tokens, preserving text-only performance when LLM parameters are frozen during training. The Llama 3.2 multimodal models, with 11-billion and 90-billion parameters, adopt this cross-attention design, opting to update the image encoder while keeping the language model frozen to retain text-only abilities.

Recent work showcases the breadth and rapid evolution of MLLM development:

  • Molmo (Multimodal Open Language Model) from the Molmo team provides an open-source, Method A solution, leveraging pretrained image encoders like CLIP and allowing joint training of all parameters.
  • NVLM (NVIDIA) explores both Method A (NVLM-D) and Method B (NVLM-X), plus a hybrid (NVLM-H), showing NVLM-X's superior efficiency for high-resolution images and NVLM-D's strength on OCR tasks. They use instruction-tuned LLMs and a multilayer perceptron for projection.
  • Qwen2-VL introduces a "Naive Dynamic Resolution" technique to handle images of varying sizes, employing a modified Vision Transformer with 2D-RoPE.
  • Pixtral 12B from Mistral AI, a Method A model, discards pretrained image encoders in favor of training one from scratch and natively supports variable image sizes.
  • MM1.5 provides practical insights into multimodal LLM fine-tuning, including an MoE variant, built on Method A.
  • Aria, a 24.9-billion-parameter MoE model, uses a cross-attention approach and trains its LLM backbone from the ground up.
  • Baichuan-Omni, a 7-billion-parameter Model A system, follows a three-stage training regime with sequential component unfreezing and incorporates an "AnyRes" module for high-resolution image handling.
  • Emu3 from Facebook AI offers a novel transformer-based decoder for image generation, trained from scratch and aligned with human preferences via Direct Preference Optimization (DPO).
  • Janus presents a framework that unifies multimodal understanding and generation within a single LLM backbone, decoupling visual encoding for broader use.

The architectural components and training methods across MLLMs differ markedly, making direct performance comparisons difficult due to divergent benchmarks and pervasive data contamination. Nonetheless, the overarching takeaway is that multimodal LLMs can be successfully built through a wide array of designs, reflecting a field rich with innovation beyond the classic scaling of text-only models.

Conclusion: A Maturing Field

The current debate over scaling laws signals a maturation of the AI sector. While investments in massive pre-training will undoubtedly persist, the nature of "progress" is diversifying. The "natural decay in scaling laws" and the "high variance in expectations of LLM capabilities" underscore that merely adding scale is no longer enough. The latency inherent in "large-scale, interdisciplinary engineering efforts" required to achieve the next order of magnitude in core model capability further highlights this. These dynamics, however, do not herald the end of scaling but rather a redirection. Progress will become "exponentially harder over time" along established axes, demanding greater focus on alternative paths such as agents, reasoning models, and the fusion of varied data modalities via advanced multimodal architectures. The fundamental question is shifting from whether AI will keep scaling to what aspects of AI will be scaled next.

Related Questions

Introduction
The Shifting Tides of AI: Beyond Scaling Laws to Multimodal Reasoning
The Chinchilla Paradigm: A Reassessment of Proportionality
The "Death" of Scaling Laws: A Narrative of Diminishing Returns
New Frontiers: Practical Scaling Laws and Reasoning Models
The Rise of Multimodal LLMs: Integrating Diverse Modalities
Conclusion: A Maturing Field