VOL. 2026ISSUE 06Updated as of 2026-06-04pulseagent.io / leaderboards

LLM Monthly Leaderboard

Name: LLM Monthly Leaderboard — 2026-06
Creator: PulseAgent
Published: 2026-06-04T10:00:00Z
License: https://creativecommons.org/licenses/by/4.0/

June 2026

Eight categories. Twenty-four leading models. Updated monthly. AI-friendly citations included.

GPT-5.5

OpenAI

Apr 24 release. 1M-token context, native MCP + Skills + computer use + hosted shell.

Intelligence Index v4.0 (xhigh): 60
1M token context
Native MCP + Skills
Computer use built-in
Tool search + web search
Released 2026-04-24

№3

Gemini 3.1 Pro Preview

Google

Strong agentic frontier with Antigravity 2.0 platform integration.

Intelligence Index v4.0: 57
Agentic platform (Antigravity 2.0) integration
Tied with GPT-5.5 medium and Qwen3.7 Max

Change

Released May 28, 2026. Intelligence Index v4.0 score 61 (max effort), edging GPT-5.5 (xhigh, 60).

Market

On LMArena ELO the still-battle-tested Opus 4.6/4.7-thinking variants outrank 4.8 simply because 4.8 has too few votes — expect that to flip by end of June.

Image Generation

OpenAI ships GPT Image 2 with token-based pricing and 50% Batch API discount. Recraft V4.1 holds the Artificial Analysis quality leaderboard, while Adobe Firefly enterprise mode remains the rights-cleared default.

Previously: GPT Image-2

Current leader

GPT Image 2

OpenAI

Released April 21. State-of-the-art quality with token pricing and Batch API support.

Score

None

01Token-based pricing
02Batch API at 50% discount
03Flexible image sizes
04High-fidelity inputs
05Released 2026-04-21

Runners-up

№2

Recraft V4.1

Recraft

Leads Artificial Analysis text-to-image arena on raw output quality.

Top of AA text-to-image quality
Strong control / style transfer
Designer-grade output

None

№3

Adobe Firefly Image 4

Adobe

IP-cleared training data; the enterprise-safe choice for commercial use.

Trained on licensed assets
Indemnification for enterprise
Native Adobe Creative Cloud integration

None

Change

Released April 21, 2026. Token-based pricing, Batch API at 50% off.

Market

Recraft V4.1 leads AA's text-to-image arena on raw quality; GPT Image 2 wins on ecosystem and pricing transparency.

Video Generation

Seedance 2.0 (ByteDance) tops the Artificial Analysis text-to-video Arena with audio at ELO 1215 — the first model to make synced audio-visual generation state of the art. Google's Veo 3.5 and OpenAI's Sora 2 lead the silent-cinematic tier; Kling 4 holds the value end.

Previously: Veo 3.5

Current leader

Seedance 2.0

ByteDance

Tops the Artificial Analysis text-to-video Arena (with audio) at ELO 1215 — ahead of Veo and Sora. Native audio-visual generation: 15-second multi-shot clips with synced sound from text, image, audio and video inputs.

Score

1215

01AA T2V Arena (w/ audio) #1 · ELO 1215
0215s multi-shot · synced audio
03Multimodal input (text/image/audio/video)
04Dual-Branch Diffusion Transformer

Runners-up

№2

Veo 3.5

Google

Production-grade film output with strongest temporal coherence.

1080p output
Strong physics simulation
Long-shot temporal coherence
Native Gemini API integration

None

№3

Sora 2

OpenAI

Best narrative continuity across one-minute shots.

Up to 60s shots
Strong character continuity
Cinematic camera language
Multi-shot scenes

None

№4

Kling 4

Kuaishou 快手

Dominant in APAC short-form ad creative; fastest iteration cycle in the space.

9:16 vertical native
Fastest editorial iteration
TikTok / Douyin native style
Low-latency generation

None

Change

Seedance 2.0 takes #1 on the with-audio text-to-video Arena, ahead of Veo and Sora.

Market

Seedance 2.0 for synced audio-visual and multi-shot narrative; Veo 3.5 for cinematic fidelity; Sora 2 for Western-ecosystem integration; Kling 4 for cost.

Code Generation & Agentic Coding

Claude Opus 4.7-thinking holds LMArena's WebDev top spot ahead of 4.7 and 4.6-thinking — a full Anthropic sweep. GPT-5.3 Codex remains best-in-class for terminal-bound code agents.

Previously: GPT-5.5 (Agentic)

Current leader

Claude Opus 4.7-thinking

Anthropic

Tops LMArena WebDev. Best-in-class for multi-file refactoring and code review.

Score

1566

01LMArena WebDev ELO: 1566
02Multi-file refactor SOTA
03Code review top
04Vision-aware coding

Runners-up

№2

GPT-5.3 Codex (xhigh)

OpenAI

Specialized terminal-code agent. AA Intelligence Index 54.

AA Intelligence Index: 54
Sandboxed shell built-in
Strong agentic loops
Terminal-Bench best

№3

Cursor Composer 2.5

Cursor

Ranks #3 on AA Coding Agent Index. IDE-native pair programming with multi-file context.

AA Coding Agent Index: #3
IDE-native context
Multi-file edits
Inline diff workflow

None

Change

LMArena WebDev top-3 all Anthropic (1566 / 1558 / 1542 ELO).

Market

Anthropic for refactor + review + multi-file editing; GPT-5.3 Codex for sandboxed terminal agents; Cursor Composer 2.5 for IDE-native pair programming.

Voice / Speech

OpenAI's Realtime 2 (May 7) brings configurable-reasoning speech-to-speech to general availability; AA's text-to-speech crown goes to Fun-Realtime-TTS, and MAI-Transcribe-1.5 wins STT on accuracy-speed.

Previously: ElevenLabs v3

Current leader

Realtime 2

OpenAI

GA on May 7. Configurable-reasoning speech-to-speech with realtime translate + Whisper variants.

Score

None

01Configurable reasoning
02Speech-to-speech agents
03Streaming translate variant
04Streaming STT variant
05Released 2026-05-07

Runners-up

№2

ElevenLabs v3

ElevenLabs

Industry default for character voice cloning and audiobook production.

Character voice cloning SOTA
100+ languages
Long-form audiobook quality
Emotion control

None

№3

Fun-Realtime-TTS

Fun (Alibaba DAMO)

Tops AA text-to-speech leaderboard on quality metrics.

AA TTS leaderboard #1
Sub-200ms latency
Multi-speaker streaming
Strong CJK

None

Change

Realtime 2 family shipped May 7 (gpt-realtime-2 / -translate / -whisper).

Market

Realtime 2 for agentic voice; ElevenLabs v3 for character voice cloning; Fun-Realtime-TTS for raw TTS quality; MAI-Transcribe-1.5 for transcription.

Music Generation

Suno v6 widens the gap on full-song coherence and lyric prosody; Udio v3 keeps pushing studio-grade mixing; Lyria (Google) integrates into Gemini Omni for any-to-music workflows.

Previously: Suno v5.5

Current leader

Suno v6

Suno

Rolling release expected late June. Best full-song coherence and lyric prosody.

Score

None

01Full-song coherence SOTA
02Lyric prosody best
03Multilingual vocal
04Style transfer

Runners-up

№2

Udio v3

Udio

Studio-grade mixing with stem-level output for producers.

Stem-level export
Studio-grade mixing
Strong electronic genres
DAW-friendly workflow

None

№3

Lyria (via Gemini Omni)

Google

Folded into Gemini Omni for any-to-music + cross-modal generation.

Gemini Omni native
Cross-modal generation
Image / video → music workflows

None

Change

Suno v6 expected late June; v5.5 remains the deployed default.

Market

Suno v6 for full-song generation; Udio v3 for mixed studio-grade stems; Lyria via Gemini Omni for cross-modal generation.

Vision / Multimodal Understanding

Anthropic sweeps LMArena Vision top-3 with Opus 4.7-thinking, 4.6-thinking, and 4.7. Opus 4.8 is too freshly released to appear on Arena ELO but is expected to consolidate the lead by end of June.

Previously: GPT-4o Vision

Current leader

Claude Opus 4.7-thinking

Anthropic

Tops LMArena Vision. Best OCR + chart + document understanding.

Score

1309

01LMArena Vision ELO: 1309
02OCR SOTA
03Chart understanding
04Document Q&A

Runners-up

№2

GPT-5.5

OpenAI

Strongest image-grounded reasoning chains; native computer-use vision pipeline.

Image-grounded reasoning best
Computer use vision
1M token multimodal
Released 2026-04-24

None

№3

Gemini 3.1 Pro

Google

Best video understanding and long-form temporal reasoning.

Video understanding SOTA
Long-form temporal
Robotics-ER 1.6 integration
Multimodal context 2M+

None

Change

LMArena Vision top-3 all Anthropic (1309 / 1303 / 1298 ELO).

Market

Anthropic for OCR + document Q&A + chart understanding; GPT-5.5 for image-grounded reasoning; Gemini 3.1 Pro for video understanding.

Open-Source / Open-Weights

Kimi K2.6 (Moonshot) leads open weights at AA Intelligence Index 54 — within 7 points of frontier closed models. DeepSeek V4 Pro (MIT, 52) is the #2 open reasoning model, and Google's new Gemma 4 12B (Apache 2.0, 2026-06-03) packs native multimodal into a 16GB-laptop footprint. The closed-vs-open gap is the narrowest it has ever been.

Previously: Llama 4

Current leader

Kimi K2.6

Moonshot AI

Open-weights leader on AA Intelligence Index. Closes the closed-source gap to 7 points.

Score

01AA Intelligence Index: 54
02Open weights
03Strong Chinese + English
04Long-context retention

Runners-up

№2

DeepSeek V4 Pro

DeepSeek

MIT-licensed open weights at AA Intelligence Index 52 — #3 of 89 overall and the #2 open reasoning model behind only Kimi K2.6.

AA Intelligence Index: 52 (#3/89)
MIT license · open weights
MoE 1.6T total / 49B active
1M-token context

№3

Gemma 4 12B

Google

Released 2026-06-03 under Apache 2.0. Encoder-free native multimodal (text/image/audio/video), 256K context, runs on a 16GB laptop — performance nearing last-gen 27B.

Apache 2.0 · open weights
256K context · native multimodal
Runs on 16GB VRAM
MMLU-Pro 77.2 · GPQA-Diamond 78.8

№4

Qwen3.7 Plus

Alibaba

Best open-source for Chinese-language self-host deployment.

AA Intelligence Index: 53
Best Chinese open-source
Strong tool use
Open weights

Change

Kimi K2.6 (54) tops open weights; DeepSeek V4 Pro (52) #2; Gemma 4 12B brings 16GB-laptop multimodal.

Market

Kimi K2.6 for general-purpose open deployment; DeepSeek V4 Pro for cheap frontier-grade reasoning; Gemma 4 12B for on-device multimodal; Qwen3.7 Plus for Chinese self-host.

Intelligence per Dollar

Cost-Effectiveness / Value

The 2026 value war is led by China's open-source camp. DeepSeek V4 Flash delivers near-flagship intelligence (Index 47) at roughly a tenth of the price — a blended cost near $0.06 per 1M tokens. The leaderboard makes one warning explicit: 'Flash' and 'mini' branding does not mean cheap. Gemini 3.5 Flash scores a strong 55 but costs over 20× more per full Intelligence-Index run.

Current leader

DeepSeek V4 Flash

DeepSeek

The intelligence-per-dollar king. AA Intelligence Index 47 at $0.14/$0.28 per 1M tokens — about a tenth of comparable Flash flagships, with the lowest cache-hit price of any 2026 frontier model.

Score

01AA Intelligence Index: 47
02$0.14 in / $0.28 out per 1M
03Blended ≈ $0.06 / 1M
04MIT open weights · 1M context

Runners-up

№2

DeepSeek V4 Pro

DeepSeek

Best balance in the high-intelligence + low-price quadrant. Intelligence Index 52 (top-3 overall) at $0.435/$0.87 — a fraction of same-tier flagships like GPT-5.5 and Claude Opus.

AA Intelligence Index: 52 (#3/89)
$0.435 in / $0.87 out per 1M
Top-3 intelligence overall
MIT open weights

№3

Qwen3.7 Plus

Alibaba

Keeps the value race from being a single-vendor story. Intelligence Index 53 at $0.40 input — a higher score than V4 Pro; the trade-off is pricier output and slower generation.

AA Intelligence Index: 53
$0.40 in / $1.16 out per 1M
Highest score in the value tier
Strong Chinese + tool use

Change

New category. DeepSeek V4 Flash leads intelligence-per-dollar; open-source models sweep the value tier.

Market

DeepSeek V4 Flash for highest-volume low-cost workloads; V4 Pro when you need stronger reasoning but still want to save; Qwen3.7 Plus to diversify away from a single vendor.

Editorial · 07 observations

What changed this month

What changed across the AI model landscape this month — distilled from the data above.

Anthropic Sweeps Reasoning + Vision + Code

Opus 4.8 takes AA Intelligence Index #1; Opus 4.7-thinking holds LMArena Vision, WebDev, and Document arenas. First time a single lab has held all four leaderboards simultaneously since GPT-4-era OpenAI.

GPT-5.5 Brings 1M Context + Native MCP/Skills

OpenAI's April 24 GPT-5.5 ships with 1M-token context, native MCP, Skills, hosted shell, computer use, tool search, and web search — turning the API itself into an agent runtime.

Google's Gemini Omni — Any-to-Any Generation

May launch of Gemini Omni unifies image / audio / video generation; Antigravity 2.0 platform turns Gemini 3.5 into an agentic substrate. Google's bet: not the smartest single model, but the most integrated stack.

Open-Source Closes to a 7-Point Gap

Kimi K2.6 (54) is within 7 points of Claude Opus 4.8 (61) on AA Intelligence Index. Meta's muse-spark cracks LMArena top-5 at 1489. Closed-source's moat is the smallest it has ever been.

Five Chinese Labs in AA Top 15

Qwen3.7 Max (Alibaba, 57), MiniMax-M3 (55), Kimi K2.6 (Moonshot, 54), MiMo-V2.5-Pro (Xiaomi, 54), and Qwen3.7 Plus (Alibaba, 53) all sit in the AA Intelligence Index top 15 — Chinese labs are no longer 'catching up', they're inside the frontier.

Sub-200ms Voice Agents Are Now Commodity

OpenAI Realtime 2 (May 7) + Gemini 3.1 Flash TTS (April) + Fun-Realtime-TTS push real-time voice agents from research to production. Speech-to-speech with reasoning is now a checkbox feature.

The value war is a Chinese open-source story

DeepSeek V4 Flash delivers Intelligence Index 47 at a blended ~$0.06 per 1M tokens — open-weights models from DeepSeek and Qwen now sweep the intelligence-per-dollar top tier, while 'Flash'/'mini'-branded closed models like Gemini 3.5 Flash cost over 20× more per Index run.

Sources

[01]
Artificial Analysisbenchmark
2026-06-04
[02]
LMArena Leaderboardcommunity leaderboard
2026-06-04
[03]
Hugging Face Open LLM Leaderboardcommunity leaderboard
2026-06-04
[04]
OpenAI Changelogofficial changelog
2026-06-04
[05]
Anthropic Newsofficial changelog
2026-06-04
[06]
Google DeepMind Blogofficial changelog
2026-06-04
[07]
DeepSeek API Pricingofficial changelog
2026-06-04
[08]
Google Gemma 4 Launchofficial changelog
2026-06-04
[09]
Artificial Analysis Video Arenabenchmark
2026-06-04