ERNIE 5.0 Tries to Solve Multimodal AI by Treating Everything Like Text

This is a Plain English Papers summary of a research paper called ERNIE 5.0 Technical Report. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The multimodal prediction problem

We've gotten remarkably good at building specialized models. GPT-4 handles text beautifully. Vision transformers excel at images. But these models live in isolation. Ask them to understand a video with audio and narration simultaneously, to reason across all these modalities, and you run into a wall. Each model speaks its own language. They don't coordinate.

This fragmentation creates two practical headaches. First, it forces developers to string together separate systems, creating latency bottlenecks and integration complexity. Second, it prevents the kind of deep multimodal reasoning that should be possible when you actually have all the information at once. When a model understands text, vision, video, and audio in truly unified ways, they can inform each other. A caption becomes richer when grounded in actual visual context. A description becomes more accurate when guided by audio cues.

The core challenge is conceptual, not just technical. How do you define a single training objective that makes sense across such fundamentally different modalities? Text unfolds sequentially. Images exist in 2D space. Audio has its own temporal dynamics. Video combines spatial and temporal structure. Each has its own compression schemes, its own natural granularity.

ERNIE 5.0 answers this by asking a simpler question: what if they're all just sequences?

One objective to rule them all

Autoregressive prediction is deceptively powerful. A language model predicts the next word given everything that came before. A vision model can predict the next visual patch. An audio model can predict the next audio frame. The mathematical structure is identical, even though the "next token" looks completely different depending on modality.

This insight unlocks everything. ERNIE 5.0 serializes all modalities into a unified token stream and trains on a single objective: predict the next group of tokens, regardless of what modality they represent. Text gets tokenized normally. Images are broken into patches, compressed through a hybrid CNN-ViT representation, and merged via attention-based patch merging. Audio gets decomposed into residual levels and combined into unified audio tokens. Video combines frame-level and patch-level tokenization to capture both temporal and spatial structure.

Architecture showing text, vision, and audio encoded and serialized through a unified autoregressive paradigm

ERNIE 5.0 architecture trained under unified autoregressive paradigm integrating multimodal understanding and generation

The unification goes deeper than just format conversion. By making text, vision, audio, and video all flow through the same prediction objective, the model naturally learns to coordinate between them. When you ask it to caption an image, it's not executing a separate "image-to-text" pathway. It's continuing the same sequence prediction task. Cross-modal reasoning isn't engineered in; it emerges from the training signal.

This is where true multimodal understanding becomes possible. The model learns that certain visual patterns correlate with certain linguistic patterns. Audio context can disambiguate what text should come next. The unified objective forces genuine integration.

Understanding and generation both work through this same framework. For vision understanding, the system extracts visual features and compresses them into tokens that feed the sequence model. For vision generation, the sequence model outputs tokens that get decoded back into images. The same architecture that processes incoming images can produce them.

Unified vision understanding and generation architecture with CNN-ViT representation and attention-based patch merging

Vision understanding and generation uses hybrid CNN-ViT representation with attention-based patch merger for compression

Audio follows the same pattern. Multiple residual levels get combined to form audio token representations for understanding. For generation, the model outputs audio tokens that get synthesized back into waveforms.

Depth-wise architecture for audio understanding and generation with multiple residual levels

Audio architecture combines embeddings from multiple residual levels to form unified token representation

Smart routing through a sparse crowd

A trillion-parameter model only works if you can actually run it. Here's the constraint that makes sparse routing essential: activation all parameters on every input would require exabytes of memory and make inference impossibly slow. Mixture-of-experts architecture solves this by organizing the model into specialized components and routing each input through only the relevant ones.

ERNIE 5.0 uses ultra-sparse routing, meaning each token might activate only a few billion parameters out of a trillion total. The routing decision determines which experts see which tokens. In traditional MoE, you might have separate expert groups for different modalities. ERNIE 5.0 does something harder: modality-agnostic experts. The same expert pool handles text, images, video, and audio.

This creates an interesting optimization problem. The router has to learn which computational patterns help across all modalities, not just within one. Through training, the system discovers which experts are universal pattern-matchers (useful for everything) versus which specialize in coordinating between specific modalities.

The early layers show broad expert activation patterns, handling raw signal diversity. Middle layers activate more selectively, specializing in cross-modal coordination. Later layers return to broader activation, synthesizing diverse signals into coherent output.

Expert utilization patterns across modalities and tasks showing activation frequencies at different layers

Expert utilization patterns reveal which experts activate frequently across modalities and tasks in early, middle, and final layers

When you measure expert overlap between modalities using intersection-over-union, a striking pattern emerges. The top experts for text overlap substantially with the top experts for images, which overlap with audio experts. But the overlap isn't complete. This balance is precisely what you'd want: enough shared representation for genuine multimodal reasoning, enough specialization to handle modality-specific structure.

Expert collaboration visualization showing intersection over union of top 25% most activated experts

Expert collaboration patterns measured through intersection over union of top 25% most frequently activated experts across modalities

This matters beyond just efficiency. By forcing experts to serve multiple modalities, the model can't fall back on modality-specific hacks. It has to learn generalizable patterns. The unified routing becomes the mechanism through which true multimodal understanding actually happens.

The elastic training revolution

Once ERNIE 5.0 exists in full form, how do you handle deployment? Cloud datacenters have utterly different constraints than edge devices. Some users need raw performance. Others need to minimize latency. Still others are constrained by memory.

The traditional solution is wasteful: train models of different sizes separately. Or worse, train once and hope pruning works. Elastic training inverts this problem. During a single pre-training run, ERNIE 5.0 learns to operate at multiple depths (fewer layers), widths (fewer experts per layer), and sparsity levels (activating fewer experts). It's one training run that produces a family of sub-models.

Elastic training framework showing depth, width, and sparsity adaptation in unified MoE architecture

Elastic training framework supports elastic depth by varying active layers, elastic width through varying expert numbers, and elastic sparsity through activation patterns

The mechanism works through randomization during training. Sometimes the model uses only 64 of its 112 layers. Other times 80. Other times all 112. It learns that good predictions are possible at different depths, not just at full capacity. Similarly, the number of active experts varies throughout training.

The resulting sub-models aren't degraded versions of the full model. They're optimized in their own right, having received diverse training signals about what works well at their particular scale. When deployment time comes, you extract the sub-model that matches your constraints. No retraining needed. The optimization already happened.

This solves one of AI's growing pains: model fragmentation. Right now, companies maintain separate small, medium, and large versions of popular models. Elastic training means training once and deploying many times. It's also more efficient during training itself, because you're getting multiple models' worth of gradient signal per forward pass.

Teaching models through reinforcement learning at scale

Autoregressive prediction is a foundation, but it's not sufficient for high-quality generation. A model trained purely on "predict the next token" learns statistical patterns, but not necessarily what humans actually want. Reinforcement learning bridges that gap by using feedback on generation quality to refine the model.

Applying RL to a sparse, multimodal trillion-parameter model creates novel challenges. Traditional RL training assumes stable batch sizes and consistent computation. But sparse MoE means different inputs take different paths through the network. Some queries are computationally cheap. Others route through expensive experts. During synchronized training, slow sequences hold up fast ones. Long-tail queries (the weird, hard ones) become bottlenecks.

ERNIE 5.0 introduces three techniques to solve this. The Unbiased Replay Buffer ensures that every query type, even rare ones, gets fairly represented in RL updates. It assigns unique indices to queries and tracks replay probability explicitly, preventing edge cases from being forgotten.

Unbiased replay buffer visualization comparing with synchronous RL training

Unbiased replay buffer addresses long-tail queries by ensuring fair representation compared to standard synchronous RL approaches

For stability under sparse routing variability, the framework introduces IcePop, a technique that smoothly transitions between mixed training objectives and gradient-based policy optimization. This keeps training dynamics stable even when different sequences have wildly different computational costs.

Training dynamics of IcePop with mixed and GSPO objectives during RL training

IcePop training dynamics show stability improvements through smooth transition between mixed objectives and gradient-based policy optimization

The third innovation addresses sparse reward problems. When rewards are thin (like "is this answer correct?"), providing hints about intermediate reasoning helps. Adaptive Hint-based Reinforcement Learning uses think-skeletons, where the model reasons through steps and gets guidance on whether each step makes sense. This accelerates learning on hard queries where the final reward signal alone doesn't provide enough gradient.

Adaptive hint-based reinforcement learning with think skeletons guiding hard queries

Adaptive hint-based reinforcement learning introduces think skeletons to guide hard queries and mitigate sparse reward problems

Together, these techniques let you apply human feedback and task-specific optimization to massive multimodal models without triggering training instability. This is where the model goes from statistically reasonable to actually useful.

What the model actually learned

After training completes, a natural question emerges: did the unified objective and modality-agnostic routing actually create genuine multimodal understanding? Or did the model just learn separate pathways under the hood, with different experts specializing in different modalities despite the unified training?

The answer lies in visualization. Expert activation patterns reveal the model's internal structure. Early layers show frequent activation across diverse experts, suggesting they handle raw signal diversity without strong specialization. Middle layers show more selective patterns, indicating specialization in specific transformations. Later layers broaden again, coordinating diverse signals into coherent output.

What's notable is the consistency across modalities. The same expert frequently activates whether processing text, images, audio, or video. But not all experts activate equally for all modalities. Instead, a clear structure emerges: certain experts are universal pattern-matchers that help everything. Others specialize in specific cross-modal reasoning tasks. The model found a real balance.

Expert collaboration metrics deepen this picture. By measuring the overlap in top-activated experts between modalities, you can see how much coordination happens. The overlap is substantial, confirming that the model built shared representations. But it's not complete, meaning the model preserved room for modality-specific optimization. This isn't collapse to false universality, but genuine multimodal coordination.

This visualization confirms what the unified training objective predicted: true multimodal understanding emerged from the architecture, not from clever prompt engineering or explicit knowledge distillation. The model learned that reasoning about text informs vision understanding and vice versa. Cross-modal patterns became features the model actively used.

Production-scale unification

ERNIE 5.0 represents a meaningful shift in how foundation models can be built. For years, the industry assumed specialization was necessary: GPT-4 for language, separate models for vision. ERNIE 5.0 demonstrates that unified multimodal training at scale is viable and practically useful.

The elastic training paradigm addresses a real pain point in deployment. As companies scale AI systems, they spend nearly as much engineering effort adapting models to different hardware as they do on research itself. Training once and deploying many times cuts that burden substantially.

The RL techniques open new possibilities for multimodal alignment. You can now apply human feedback to a unified model handling all modalities simultaneously, creating coordination that training RL on separate models could never achieve.

Related work on sparse MoE systems like TeleChat3 has explored similar routing efficiency gains in language models, while research on temporal-textual multimodal frameworks has tackled cross-modal coordination problems from different angles. ERNIE 5.0 combines these insights within a production system and provides detailed empirical analysis of what actually happens when you unify modalities at this scale.

The detailed visualizations of expert routing provide something rare: interpretability into how massive models organize themselves. You can see which patterns matter for which tasks, and how the model balances universal versus specialized computation. This offers concrete guidance for scaling unified models further and understanding whether true multimodal reasoning actually emerges during training.

For practitioners, ERNIE 5.0 proves that unified multimodal models are deployable today. For researchers, the empirical analysis of elastic training and modality-agnostic routing provides a foundation for thinking about the next generation of foundation models.