Moonshot AI unveiled Kimi K2, an open-source LLM challenging Google, Anthropic, and OpenAI. Released July 16, 2025, Kimi K2-Instruct shows superior coding (53.7 LiveCodeBench) and reasoning (97.4% MATH-500) performance. Built with 1 trillion parameters, it's 5x cheaper than rivals like Claude Sonnet.
Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters.
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.
GPT-4 is a very old model and not listed in linked benchmarks. Kimi K2 compares well to current SotA, not SotA from 2 years ago.
BEIJING — A Chinese artificial intelligence firm, Moonshot AI, has introduced the Kimi K2 large language model series, an open-source offering now claiming to surpass established benchmarks typically held by proprietary models from companies like Google, Anthropic, and OpenAI. Released on July 16, 2025, the flagship Kimi K2 model, specifically the Kimi-K2-Instruct variant, reportedly demonstrates superior performance in critical domains such as coding, reasoning, and mathematical tasks, according to the developer.
The Kimi K2 series, developed by the Moonshot AI team, represents a significant development in the burgeoning field of open-source large language models. The primary Kimi K2 is a mixture-of-experts (MoE) model, featuring 32 billion activated parameters within a total of 1 trillion parameters. This architecture is designed for efficiency and performance at scale.
According to technical documentation provided by Moonshot AI, the Kimi K2-Instruct model exhibits a notable edge across multiple evaluation metrics. In coding tasks, the model achieved a Pass@1 score of 53.7 on the LiveCodeBench v6 benchmark (August 2024 - May 2025 data), outperforming GPT-4.1 (44.7), Claude Sonnet 4 (48.5), and Claude Opus 4 (47.4). On the SWE-bench Verified Agentless Coding test, Kimi K2-Instruct secured a 51.8% accuracy for single patches without tests, exceeding Claude Opus 4's 53.0% only slightly and significantly surpassing GPT-4.1's 40.8%. In SWE-bench Verified Agentic Coding, it achieved 65.8% single-attempt accuracy, placing it beneath Claude Sonnet 4's 72.7%.
Beyond coding, Kimi K2-Instruct has demonstrated strong capabilities in specific STEM and general reasoning tasks. It recorded a score of 69.6 on the AIME 2024 (Avg@64) and 97.4% accuracy on the MATH-500 benchmark. For general tasks, its MMLU (Massive Multitask Language Understanding) score was 89.5, with an MMLU-Redux score of 92.7. While these scores are competitive, some proprietary models like Claude Opus 4 have achieved higher on certain MMLU variants (e.g., 92.9 on MMLU).
Moonshot AI attributes Kimi K2's performance to large-scale training practices, involving pre-training on 15.5 trillion tokens, and the application of their bespoke MuonClip Optimizer. The company states this optimizer was instrumental in resolving training instabilities at an unprecedented scale. Furthermore, Kimi K2 has been specifically engineered for "agentic intelligence," emphasising tool use, reasoning, and autonomous problem-solving. This focus is demonstrated through its reported ability to plan trips using 17 in-browser tools and build complete web games, including a Minecraft clone, in a single operational sequence.
The Kimi K2 line consists of two main variants:
Moonshot AI has made Kimi K2 available for public access at kimi.ai, a move that could significantly democratise access to advanced AI capabilities. The token costs for Kimi K2 are claimed to be five times cheaper than those of comparable models such as Claude Sonnet or Gemini 2.5 Pro. The model is also stated to handle "100K-token research with clean visualizations," indicating robust context window capabilities.
The release of Kimi K2 underscores a growing trend where Chinese AI firms are not only catching up but, in some areas, potentially setting new benchmarks for open-source AI development. This move could intensify competition in the global AI landscape, particularly as the open-source community gains access to models previously only achievable by well-resourced private entities. The full impact of Kimi K2 will depend on its adoption by developers and its performance in real-world applications beyond benchmark evaluations.
The original article claims that MoonshotAI's Kimi K2 is a "trillion-parameter model that beats Claude, Gemini, and even GPT-4.1" and is "fully open" and "free and open to everyone." It also lists several performance claims, such as matching Claude 4 in reasoning, outperforming DeepSeek v3, Qwen, and GPT-4.1, and being 5x cheaper than competitors.
Upon examination of the external source, the GitHub repository for Moonshot AI's Kimi-K2, the claims are largely substantiated but require nuanced interpretation. Kimi K2 is indeed a mixture-of-experts (MoE) model with "1 trillion total parameters" and "32 billion activated parameters." The repository states that it's designed for tool use, reasoning, and autonomous problem-solving. This aligns with the original article's claims about its capabilities, such as building web games and planning trips.
The performance claims are detailed in the GitHub source's benchmark tables. For "Coding Tasks," Kimi K2 Instruct frequently achieves the highest scores (marked bold as global SOTA) on benchmarks like LiveCodeBench v6, OJBench, SWE-bench Verified (Agentless Coding), and SWE-bench Verified (Agentic Coding). However, for other coding benchmarks like MultiPL-E and Aider-Polyglot, Claude Opus 4 or Qwen3-235B-A22B achieve higher scores, respectively. This indicates that while Kimi K2 performs exceptionally well in many coding scenarios, it does not universally "beat" all mentioned models across all coding tasks.
In "Tool Use Tasks," Kimi K2 Instruct leads in Tau2 retail, airline, and telecom benchmarks. However, for AceBench, GPT-4.1 shows a higher accuracy score.
For "Math & STEM Tasks," Kimi K2 Instruct consistently performs strongly, often achieving the global SOTA in AIME 2024, AIME 2025, MATH-500, HMMT 2025, PolyMath-en, ZebraLogic, and GPQA-Diamond. Yet, in CNMO 2024, Gemini 2.5 Flash achieves a slightly higher score, and in Humanity's Last Exam (Text Only), Claude Opus 4 leads.
Regarding "General Tasks," Kimi K2 Instruct leads in MMLU and MMLU-Redux. However, Claude Opus 4 surpasses it in MMLU-Pro and SimpleQA (where GPT-4.1 also scores higher). IFEval shows Kimi K2 Instruct with the highest score, while Livebench also shows Kimi K2 Instruct in the lead.
Crucially, the external source indicates that Kimi K2 has "Model Variants": "Kimi-K2-Base" and "Kimi-K2-Instruct." The benchmark results provided primarily pertain to the "Kimi K2 Instruct" model. The claim of being "fully open" and "free and open to everyone" is consistent with the GitHub repository, which provides code examples for local inference and tool calling.
The claim about token costs is not directly addressed in the provided GitHub source. However, being an open-source model, its deployment costs would inherently be lower for users compared to proprietary APIs like Claude or Gemini, which directly supports the cost-saving assertion.
Overall, the original article's assertion that Kimi K2 "beats" competitors is an oversimplification. While it demonstrates superior performance in numerous specific benchmarks across various categories (coding, tool use, math, general tasks), it does not universally outperform all mentioned models in every single metric. The model is indeed a significant open-source release with impressive capabilities, but the phrasing of beating other models is somewhat sensationalized given the detailed benchmark data.
29 жовтня 2025 р.
Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters.
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.
GPT-4 is a very old model and not listed in linked benchmarks. Kimi K2 compares well to current SotA, not SotA from 2 years ago.
Related Questions