585 讀數
585 讀數

这种新语言可能会杀死NVIDIA的GPU垄断

经过 Thomas Cherickal29m2025/07/11
Read on Terminal Reader

太長; 讀書

Multi-Level Intermediate Representation (MLIR) 和 Mojo 编程语言 (Mojo) 代表了我们如何为任何硬件设计、优化和部署软件的根本范式转变。
featured image - 这种新语言可能会杀死NVIDIA的GPU垄断
Thomas Cherickal HackerNoon profile picture
0-item
1-item
2-item


所有 由作者生成的图像 免费使用 NightCafe Studio - 请参阅链接的脚印。

所有 由作者生成的图像 免费使用 NightCafe Studio - 请参阅链接的脚印。

所有 由作者生成的图像 免费使用 NightCafe Studio - 请参阅链接的脚印。

高性能计算的时代由一个单一的名称定义:奇迹

奇迹

NVIDIA的平台解锁了GPU的功率,成为事实上的标准。

十多年来,编程GPU意味着用CUDA编程。

然而,这种主导地位创造了一个笼子,锁定了进展到一个单一的供应商。

然而,这种主导地位创造了一个笼子,锁定了进展到一个单一的供应商。

然而,这种主导地位创造了一个笼子,锁定了进展到一个单一的供应商。

但今天,2025年中期 - 事情正在改变。

但今天,2025年中期 - 事情正在改变。

The computing world is now undergoing a radical transformation towards heterogeneity.

我们看到专业硬件的扩散:

  • Intel Gaudi Series:

    Intel's Gaudi processors are designed specifically for deep learning training and inference, offering a competitive alternative to Nvidia's GPUs.


  • AMD Instinct MI Series:

    AMD's MI series of GPUs is designed for high-performance computing and AI workloads, providing an alternative to Nvidia's data center GPUs.


  • Groq Tensor Streaming Processor (TSP):

    Groq's TSP architecture is designed for low-latency inference and high throughput, particularly for large language models.


  • Google TPUs (Tensor Processing Units):

    Google's TPUs are custom-designed chips optimized for machine learning workloads, particularly in Google's cloud infrastructure.


  • AWS Trainium:

    AWS Trainium is a chip designed for machine learning training, offering high performance and cost-effectiveness.

越来越多的初创公司每天都会制造自定义硅芯片。

越来越多的初创公司每天都会制造自定义硅芯片。

这种新的、多样化的景观需要新的编程哲学。

多层次中间代表性(MLIR)和Mojo编程语言

Arcane glyphs? Pretty sure that's not Mojo code...

This is not just another competitor; they represent a fundamental paradigm shift.

这是我们如何设计、优化和部署任何硬件的软件的革命。

This is a revolution in how we design, optimize, and deploy software for any hardware.

This article will deeply explore the architectural chasm between CUDA and MLIR.


  1. 我们将使用完整的工作代码示例来提供具体的、实际的比较。
  2. 我们将探讨为什么MLIR是其尊敬的前身LLVM的突破。
  3. 我们将认为Mojo是优越的长期解决方案。
  4. 我们将分析为什么这个新的堆栈是成本和速度的游戏改变者。

这种影响扩展到关键的新兴领域,如Generative AI, Quantum Computing,甚至Blockchain.

我们也将展望未来,涵盖mining ASICs,Neuromorphic Computing,和specialized hardware对于 GPU 处理的稀缺数据流。

这是一个时代的结束和一个新时代的黎明的故事。

这是一个时代的结束和一个新时代的黎明的故事。

要了解这种转变的规模,我们必须首先understand the four key players.

1. CUDA: The Powerful, Proprietary Incumbent

标签:库达:强大的,拥有者

CUDA stands for Compute Unified Device Architecture.

它是NVIDIA的平行计算平台和编程模型。

它允许开发人员编写类似C++的代码,称为内核,在NVIDIA GPU上运行。

CUDA's Strengths:

CUDA的优势:

Its ecosystem of libraries is mature and unmatched:

  • Mathematical Libraries:
    • cuBLAS: For basic linear algebra subprograms (BLAS).
    • cuRAND: For random number generation.
    • cuFFT: For Fast Fourier Transforms.
    • cuSPARSE: For sparse matrix operations.
    • cuTENSOR: For tensor operations.
    • cuSOLVER: For dense and sparse direct solvers.
  • Parallel Algorithm Libraries:
    • nvGRAPH: For graph algorithms.
    • Thrust: For parallel algorithms and data structures.
  • Communication Libraries:
    • NVSHMEM: For partitioned global address space (PGAS) programming.
    • NCCL: For multi-GPU and multi-node collective communication.
  • Deep Learning Libraries:
    • cuDNN: For deep neural network computations.
    • TensorRT: For optimized deep learning inference.
    • Riva: For conversational AI.
    • DALI: For data loading and augmentation for deep learning.

它为硬件提供直接、低级别的控制,为专家提供高性能。

其悠久的历史建立了一个庞大的社区,拥有丰富的文档和支持。

Its long history has built a massive community with vast documentation and support.

CUDA's Fatal Flaw: The Cage

Vendor Lock-In: CUDA code runs only on NVIDIA GPUs.

只有

这将开发人员和整个行业整合到一个单一的昂贵的硬件供应商。

它抑制了竞争,限制了选择最佳硬件的自由。

双语问题:人工智能和科学计算中的重大障碍

研究人员用像Python这样的高级别语言进行原型,以其简单性和迭代速度。

但是,对于生产,性能关键的代码必须完全重写在低级别的C++/CUDA中。

But for production, performance-critical code must be completely rewritten in low-level C++/CUDA.

这造成了痛苦和昂贵的断开连接,减缓了从研究到部署的道路。

编程复杂性:

CUDA是强大的,但臭名昭著的复杂和语。

开发者被迫 手动内存管理器,在CPU(主机)和GPU(设备)之间传输数据。

开发人员还必须是硬件编程员,管理线程块、网格和同步。

这种复杂性是一个陡峭的学习曲线和常见的微妙错误来源。

2. LLVM: The Foundation and Its "Semantic Gap”

LLVM:基金会及其“语义差距”

LLVM项目是模块化和可重复使用的编译技术的集合。

它的核心是LLVM中间代表(IR),一个低级别的,类似组装的语言。

LLVM成为现代编译器后端的标准,特别是CPU。

编译器前端(如Clang for C++)将源代码翻译成LLVM IR。

LLVM 后端然后优化此 IR 并将其转换为特定 CPU 的机器代码。

这种模块化对于当时来说是革命性的。

然而,LLVM是为一个以CPU为中心的世界而设计的。

它的IR对于异质硬件的新世界来说太低了。

它从源代码中丢失了关键的高级信息,这个问题被称为“语义差距”。


例如,在编译 TensorFlow 模型时,将丧失一个操作是 Convolution 的知识。


LLVM IR 只看到一个通用循环和算术指令的集合。


这阻止了编译器执行强大的域特定的优化。


它不再理解程序员的高层次的意图。


这就是“语义差距问题”的本质。


这个问题是MLIR解决的。

It loses crucial high-level information from the source code, a problem known as the "semantic gap."


For example, when compiling a TensorFlow model, the knowledge that an operation is a Convolution is lost.


LLVM IR only sees a generic collection of loops and arithmetic instructions.


This prevents the compiler from performing powerful, domain-specific optimizations.


It no longer understands the programmer's high-level intent.


This is the essence of the “semantic gap problem.”


And this problem is what MLIR has Solved.

3. MLIR: The Universal Translator for Hardware

MLIR:硬件的通用翻译器

MLIR在谷歌诞生于需要编译CPU、GPU和其TPU的TensorFlow。

他们意识到LLVM的单一,低水平的IR是不够的。

MLIR的突破是为定义和组成多个IR的统一基础设施。

这些可复制的 IR 被称为方言。

方言

MLIR就像一个通用翻译器,从高层次的概念到低层次的机器细节都流畅。

高级方言可以直接代表特定域的概念。

For example, a "TensorFlow dialect" has an operation for tf.conv2d.

A "Linear Algebra dialect" has an operation for linalg.matmul.

这将保留LLVM丢弃的关键语义信息。

这使得一个强大的编译策略被称为渐进降低◎ ◎

渐进降低


  1. 编译器从高级方言表示开始。
  2. 它在这个方言上执行高级别的域特定的优化。
  3. 然后,它通过一系列中间方言逐步“降低”代码。
  4. 每个中间方言都执行自己的特定优化。
  5. 最后,它达到低级方言,如LLVM IR方言,用于最终的机器代码生成。
编译器从高级方言表示开始。

This process preserves high-level context for as long as possible.

This enables vastly superior optimizations for any hardware target.

MLIR is the missing link between high-level languages and diverse silicon.

MLIR是高水平语言和多样化硅之间的缺失链接。

MLIR是高水平语言和多样化硅之间的缺失链接。

4. Mojo: The User-Friendly Face of MLIR's Power

4. Mojo: The User-Friendly Face of MLIR's Power

If MLIR is the powerful, complex engine, Mojo is the sleek, intuitive user interface.

Mojo由Chris Lattner创建,他是LLVM和Swift语言的原始建筑师。

Mojo was created by Chris Lattner, the original architect of LLVM and the Swift language.

It is designed from first principles to be the perfect language for the MLIR era.

In this regard, it is the most technologically advanced language today.

Even Rust is based on LLVM and has all of LLVM’s shortcomings.

Even Rust is based on LLVM and has all of LLVM’s shortcomings.

Mojo is the only major programming language today based on MLIR.

Mojo is the only major programming language today based on MLIR.

Mojo's Key Features:

Python 的超级集

  • Mojo 旨在与现有 Python 生态系统完全兼容。
  • This is a killer feature!
  • 它允许开发人员导入和使用任何Python库,如NumPy,Pandas或Matplotlib。
  • It completely bypasses the "cold start" problem that new languages face by tapping into Python's vast ecosystem.

真正的系统编程功能:

  • Unlike Python, Mojo is a compiled language with strong static typing.
  • This eliminates entire classes of runtime errors and enables C++-level performance optimizations.
  • It introduces modern memory management concepts like ownership and borrowing (from Rust) for memory safety without the overhead of a garbage collector.

First-Class MLIR Integration:

  • Mojo exposes the full power of MLIR directly to the developer.
  • 程序员可以为他们的大部分应用程序编写高层次的 Pythonic 代码。
  • 当需要最大性能时,他们可以放下使用特定MLIR方言并写低级内核。
  • 至关重要的是,这一切都可以在同一文件中,在同一语言中完成。
至关重要的是,这一切都可以在同一文件中,在同一语言中完成。

Mojo优雅地解决了“双语问题”。

Mojo elegantly solves the "two-language problem."

Full Code Examples and Analysis

Full Code Examples and Analysis

理论是一回事,实践是另一回事。

The following full, working code examples -

将展示两种范式之间的深刻差异。

Example 1: Matrix Multiplication

例子一:矩阵倍增

这是高性能计算的“Hello, World!”,它清楚地揭示了每个平台的核心哲学。

The Full CUDA Implementation

This is a complete, compilable CUDA program for matrix multiplication.

(CUDA C++)

// Filename: matmul.cu
// To compile: nvcc matmul.cu -o matmul_cuda

#include <iostream>
#include <vector>
#include <cuda_runtime.h>

// Helper to check for CUDA errors
#define CUDA_CHECK(err) { \
    cudaError_t err_code = err; \
    if (err_code != cudaSuccess) { \
        std::cerr << "CUDA Error: " << cudaGetErrorString(err_code) << " at line " << __LINE__ << std::endl; \
        exit(EXIT_FAILURE); \
    } \
}

// CUDA Kernel for Matrix Multiplication (Device Code)
__global__ void matrixMulKernel(float* C, const float* A, const float* B, int N) {
    // Calculate the global row and column index of the element
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Boundary check to avoid accessing out-of-bounds memory
    if (row < N && col < N) {
        float p_value = 0.0f;
        // Each thread computes one element of the result matrix C
        for (int k = 0; k < N; ++k) {
            p_value += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = p_value;
    }
}

// Main function (Host Code)
int main() {
    const int N = 256;
    const int size = N * N * sizeof(float);

    // Step 1. Allocate host memory
    std::vector<float> h_A(N * N);
    std::vector<float> h_B(N * N);
    std::vector<float> h_C(N * N);

    // Initialize host matrices
    for (int i = 0; i < N * N; ++i) {
        h_A[i] = static_cast<float>(rand()) / RAND_MAX;
        h_B[i] = static_cast<float>(rand()) / RAND_MAX;
    }

    // Step 2. Allocate device memory
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc((void**)&d_A, size));
    CUDA_CHECK(cudaMalloc((void**)&d_B, size));
    CUDA_CHECK(cudaMalloc((void**)&d_C, size));

    // Step 3. Copy matrices from host to device
    std::cout << "Copying data from host to device..." << std::endl;
    CUDA_CHECK(cudaMemcpy(d_A, h_A.data(), size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B.data(), size, cudaMemcpyHostToDevice));

    // Step 4. Define kernel launch configuration
    // Use 16x16 threads per block, a common choice
    dim3 threadsPerBlock(16, 16);
    // Calculate the number of blocks needed in each dimension
    dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y);

    // Step 5. Launch the kernel on the device
    std::cout << "Launching kernel..." << std::endl;
    matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_C, d_A, d_B, N);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize()); // Wait for the kernel to finish

    // Step 6. Copy the result matrix back from device to host
    std::cout << "Copying result from device to host..." << std::endl;
    CUDA_CHECK(cudaMemcpy(h_C.data(), d_C, size, cudaMemcpyDeviceToHost));

    // Step 7. Free device memory
    CUDA_CHECK(cudaFree(d_A));
    CUDA_CHECK(cudaFree(d_B));
    CUDA_CHECK(cudaFree(d_C));

    std::cout << "CUDA Matrix Multiplication finished successfully." << std::endl;
    // (Optional: Add verification step here)

    return 0;
}

Analysis of the CUDA Code:

CUDA代码的分析:

代码由锅炉板和低级别管理主导。

Steps 1, 2, 3, 6, and 7 are purely for managing memory across the CPU/GPU boundary.

这是无聊的,容易犯错误的,并掩盖了核心算法。

global关键字、 blockIdx、 threadIdx 和 <<<...>> 语法是 CUDA 特定的硬件抽象。

这个代码与NVIDIA的硬件架构有着根本和永久的联系。

实际的算法 - 三个嵌入式循环 - 是总代码的一小部分。

The programmer's mental overhead is spent on hardware management, not on the problem itself.

The programmer's mental overhead is spent on hardware management, not on the problem itself.

The Full Mojo Implementation

全方位 Mojo 应用

这个Mojo版本以令人惊叹的简单性和力量实现了相同的结果。

(美)

# Filename: matmul.mojo
# To run: mojo matmul.mojo

from memory import DType, Tensor
from random import rand
from time import now

fn matmul_naive(C: Tensor[DType.float32], A: Tensor[DType.float32], B: Tensor[DType.float32]):
    """A naive, high-level implementation of matrix multiplication."""
    let N = A.dim(0)
    let M = A.dim(1)
    let P = B.dim(1)

    for i in range(N):
        for j in range(P):
            var sum: Float32 = 0.0
            for k in range(M):
                sum += A.load(i, k) * B.load(k, j)
            C.store(i, j, sum)

fn main():
    let N = 256
    
    # 1. Allocate and initialize tensors.
    # Mojo's Tensor handles memory allocation automatically.
    # The compiler will place it in the most appropriate memory space.
    var A = Tensor[DType.float32](N, N)
    var B = Tensor[DType.float32](N, N)
    var C = Tensor[DType.float32](N, N)

    for i in range(N):
        for j in range(N):
            A.store(i, j, rand[DType.float32]())
            B.store(i, j, rand[DType.float32]())

    print("Starting Mojo Matrix Multiplication...")
    
    let start_time = now()
    
    # 2. Call the function.
    # The MLIR-based compiler optimizes this high-level code.
    # It can automatically tile, vectorize, and parallelize this code
    # for the target hardware (CPU, GPU, etc.).
    matmul_naive(C, A, B)

    let end_time = now()
    let duration_ms = (end_time - start_time) / 1_000_000.0

    print("Mojo Matrix Multiplication finished successfully.")
    print("Execution time:", duration_ms, "ms")
    # (Optional: Print a corner of the result matrix to verify)
    print("Result C[0,0]:", C.load(0,0))
}

And that is all!

The Mojo Approach is Far Superior

The Mojo Approach is Far Superior

可编程和专注:

  • The Mojo code is clean and expresses the algorithm directly.
  • 程序员专注于什么(数学),而不是如何(记忆转移)。
  • There is no manual cudaMalloc, cudaMemcpy, or cudaFree.
  • That entire class of errors is gone.

Abstraction with Performance:

  • 简单的嵌套循环不是被执行的东西。
  • The MLIR-based compiler performs sophisticated transformations.
  • 这将这个简单的代码转化为高度优化的内核。
  • It can apply tiling, vectorization, and parallelization automatically.
  • The programmer can add hints (like @vectorize or @parallelize) to guide the compiler, achieving control without complexity.

Portability (The Ultimate Advantage):

  • 这是关键点。
  • 相同的 matmul.mojo 文件可以重新编译以运行在 NVIDIA GPU、AMD GPU、具有 AVX512 的 Intel CPU 或 Google TPU 上。
  • 逻辑保持不变;编译器后端发生变化。
  • The CUDA code would require a complete, costly rewrite for each new hardware target.
  • Mojo提供“性能可移植性”,打破供应商锁定和未来的代码。
Mojo offers "performance portability," breaking vendor lock-in and future-proofing the code.

毫无疑问,基于MLIR的Mojo将取代基于LLVM的CUDA,开发人员将享受这种变化!

MLIR-based Mojo is undeniably set to replace LLVM-based CUDA, and developers will enjoy the change!

For more on Mojo, refer to the article below:

Example 2: Gen AI and the Transformer Attention Mechanism

例子2:基因AI和转换器注意力机制

The "attention" mechanism is the heart of models like GPT-4 and is a major computational bottleneck.

优化是至关重要的。

Optimizing it is critical.

The CUDA Implementation (Conceptual FlashAttention)

FlashAttention是一个具有里程碑意义的算法,手动和专业地在GPU的缓慢主内存(HBM)和其快速芯片内存(SRAM)之间进行数据运动,以减少瓶颈。

真正的代码是成千上万的行长和令人难以置信的复杂。

The real code is thousands of lines long and incredibly complex.

The links to the components of the full algorithm implementation are given below:

https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/flash_fwd_kernel.h

https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/flash_api.cpp

Together, they are almost 3000 lines long.

The repository contains thousands of files.

学习曲线和登机曲线都是陡峭的。

简化的版本(AI生成的)是以下的:

(CUDA C++)

// This is a simplified conceptual view of a FlashAttention-style CUDA kernel.
// The actual implementation is far more complex.

template<typename Kernel_traits>
__global__ void flash_attention_fwd_kernel(Flash_fwd_params params) {

    // 1. Incredibly complex setup code
    // Calculates dozens of pointers and indices for HBM and shared memory (SRAM)
    const int block_row_idx = blockIdx.x;
    const int head_idx = blockIdx.y;
    // ... many more calculations ...

    // 2. Explicitly allocate shared memory tiles for Q, K, V
    // The developer must manage this limited resource manually.
    extern __shared__ char smem[];
    float* sQ = (float*)smem;
    float* sK = sQ + kTileM * kTileK;
    float* sV = sK + kTileN * kTileK;

    // 3. Main loop over the sequence, manually loading blocks
    for (int k_block_idx = 0; k_block_idx < params.k_num_blocks; ++k_block_idx) {

        // Manually orchestrate asynchronous loads from HBM into SRAM
        // to hide memory latency. This is extremely difficult to get right.
        load_qkv_block_from_hbm(params, ...);
        __syncthreads(); // Hard synchronization barrier

        // Manually perform matrix multiplication in fast SRAM
        compute_sram_matmul(sQ, sK, ...);

        // Recompute softmax "online" to avoid writing the huge intermediate
        // attention score matrix back to slow HBM. This is the core trick.
        compute_online_softmax(...);
        __syncthrows();

        // Update the output block
        update_output_block(sV, ...);
    }

    // 4. Manually write the final output block back to HBM
    store_output_to_hbm(params, ...);
}

Analysis of the CUDA/FlashAttention Approach:

分析 CUDA/FlashAttention 方法:
  • 它是手动,硬件特定的工程的杰作。
  • 它通过像手动编程的机器一样对待GPU来实现令人难以置信的性能。
  • 这使得代码几乎不可读、不可维护、不可移植。
  • 只有少数世界级的专家才能编写或修改此类代码。
  • 它代表着封闭生态系统内性能的顶峰,但也代表着复杂性和僵硬性的顶峰。

The Conceptual Mojo Implementation

Mojo 概念实施

Mojo版本表达了相同的算法概念(tiling,在线softmax)在高层次上,将硬件管弦转移到MLIR编译器。

(Mojo:)

from memory import DType, Tensor
from algorithm import parallelize

struct AttentionParams:
    var is_causal: Bool
    # ... other model parameters

# This function is a high-level, portable description of the FlashAttention algorithm.
fn flash_attention[T: DType](Q: Tensor[T], K: Tensor[T], V: Tensor[T], params: AttentionParams) -> Tensor[T]:
    # Define problem dimensions from input tensors
    let num_batches = Q.dim(0)
    let num_heads = Q.dim(2)
    let seqlen_q = Q.dim(1)
    let seqlen_k = K.dim(1)
    
    # Define tunable tiling parameters. The compiler can use these as hints.
    alias BLOCK_M: Int = 128
    alias BLOCK_N: Int = 64

    # The output tensor
    var O = Tensor[T](Q.dims)

    # The @parallelize decorator tells the compiler to map this function
    # over the available hardware parallelism (e.g., CUDA thread blocks or CPU cores).
    @parallelize(num_batches * num_heads)
    fn compute_head(batch_idx: Int, head_idx: Int):
        
        # Define per-worker accumulators. The compiler will map these
        # to the fastest available memory (e.g., registers or SRAM).
        var o_i = Tensor[T](seqlen_q, V.dim(3))
        var l_i = Tensor[T](seqlen_q) # Stores the denominator of the softmax
        var m_i = Tensor[T](seqlen_q) # Stores the max of each row for stable softmax
        o_i.zero()
        l_i.fill(0.0)
        m_i.fill(-50000.0) # Negative infinity

        # Loop over blocks of the Key/Value sequence
        for j in range(0, seqlen_k, BLOCK_N):
            # 1. Load tiles of K and V.
            # The compiler is responsible for generating the optimal code
            # to move this data from main memory to fast memory.
            let k_j = K.load_tile[BLOCK_N](batch_idx, j, head_idx)
            let v_j = V.load_tile[BLOCK_N](batch_idx, j, head_idx)
            
            # Loop over blocks of the Query sequence
            for i in range(0, seqlen_q, BLOCK_M):
                # 2. Load tile of Q.
                let q_i = Q.load_tile[BLOCK_M](batch_idx, i, head_idx)
                
                # 3. Compute attention scores for the tile. This is a simple matmul.
                let s_ij = q_i @ k_j.transpose()
                
                # Causal masking for decoder models like GPT
                if params.is_causal:
                    # Algorithmic logic, no hardware specifics
                    apply_causal_mask(s_ij, i, j)

                # 4. Perform the "online softmax" update.
                # This is pure mathematical logic, not memory management.
                let m_ij = row_max(s_ij)
                let p_ij = exp(s_ij - m_ij)
                let l_ij = row_sum(p_ij)
                
                let m_new = max(m_i, m_ij)
                let l_new = exp(m_i - m_new) * l_i + exp(m_ij - m_new) * l_ij

                # Update output tile
                o_i = (l_i / l_new * exp(m_i - m_new)) * o_i + (exp(m_ij - m_new) / l_new) * (p_ij @ v_j)

                # Update softmax stats
                l_i = l_new
                m_i = m_new

        # 5. Store the final output. The compiler manages the write-back.
        O.store_tile(batch_idx, head_idx, o_i)
    
    compute_head()
    return O

One file.

不到100个地方。

No brain-racking dependencies.

当然,这只是算法,但在存储库中,相同的算法用 CUDA 接近 3000 LOC!

Of course, this is just the algorithm, but in the repository, the same algorithm took nearly 3000 LOC with CUDA!

现在你明白了区别:

So now you understand the difference:

Mojo is Game-Changing for AI:

Mojo 正在为 AI 改变游戏:

Separation of Concerns:

  • Mojo 代码描述了算法。
  • The CUDA code describes a manual hardware implementation.
  • 这是一个深刻的区别。
  • Mojo程序员可以专注于改进算法:
  • While the MLIR compiler focuses on mapping it to silicon.

Research Velocity and Maintainability:

  • 人工智能研究人员可以很容易地理解和修改这个Mojo代码来测试一个新的想法。
  • 修改CUDA代码将是一个巨大的,耗时的工程项目,需要罕见的技能。
  • 这大大加速了研究和开发周期。

Hardware Freedom: (The Most Important)

  • 此 Mojo 代码与 NVIDIA 无关。
  • It can be compiled to run on:
    • AMD GPUs
    • Google TPUs
    • Intel Gaudi
    • Custom AI chips.
    • Any architecture there is!
  • MLIR's dialects can be extended to support any new hardware:
  • Making the Mojo code truly future-proof.

这打破了NVIDIA对高性能AI的垄断,并将降低成本。

This breaks the NVIDIA monopoly on high-performance AI and will drive down costs.

Specialized Hardware and Future Domains

专业硬件和未来的域名

I said I wanted a futuristic image. The AI art generator delivered. Cool!

The limitations of the CUDA model become even more apparent when we look beyond traditional dense workloads to the future of computing.

The limitations of the CUDA model become even more apparent when we look beyond traditional dense workloads to the future of computing.

MLIR/Mojo是为这个未来而设计的。

MLIR/Mojo is designed for this future.

Blockchain, Mining, and ASICs

区块链,矿业和ASIC

像比特币这样的工作证明区块链需要巨大的哈希功率。

目标是找到一个“nonce”,当与其他数据进行哈希时,会产生低于某个目标的结果。

这是一个粗力搜索,非常适合平行硬件。

起初,矿工使用CPU,然后为其优越的并行性使用GPU。

The CUDA code for a SHA-256 miner is low-level, focused on bitwise and integer operations.

However, for a stable, unchanging algorithm like SHA-256, the ultimate hardware is an ASIC.

However, for a stable, unchanging algorithm like SHA-256, the ultimate hardware is an ASIC.


ASIC(Application-Specific Integrated Circuit)是专为一个目的而设计的芯片,即在硬件中实现算法。

An ASIC (Application-Specific Integrated Circuit) is a chip designed for one single purpose - to implement an algorithm in hardware.

An SHA-256 ASIC has the hashing logic literally baked into the silicon.

It is thousands of times more power-efficient than a GPU for that one task.

This is where the CUDA story ends, but the MLIR/Mojo story gets even more interesting.

这就是CUDA的故事结束的地方,但MLIR / Mojo的故事变得更加有趣。

设计芯片的过程被称为高级合成(HLS)。

HLS tools convert a high-level description of an algorithm into a low-level hardware description language (like Verilog or VHDL) used to fabricate the chip.


MLIR, through projects like CIRCT (Circuit IR for Compilers and Tools), is designed to be the backbone of next-generation HLS.

MLIR, through projects like CIRCT (Circuit IR for Compilers and Tools), is designed to be the backbone of next-generation HLS.


  1. A developer could write a hashing algorithm in Mojo.
  2. For GPU mining, they would compile it using the GPU backend.
  3. 为了创建一个ASIC,他们可以使用HLS后端编译完全相同的Mojo代码。
  4. MLIR 基础设施将高级 Mojo 逻辑降为 Verilog。
相同的 Mojo 代码

This unifies the entire stack, from high-level software to custom silicon design.

It allows for rapid prototyping and deployment of new algorithms onto the most efficient hardware possible, be it a GPU or a brand new ASIC.

CUDA对此没有答案。

CUDA has no answer to this.

It is a software-only solution for a single vendor's programmable hardware.

Neuromorphic Computing and Sparse Data

神经形态计算和储存数据

NVIDIA GPUs are masters of SIMT: Single Instruction, Multiple Thread.

NVIDIA GPUs are masters of SIMT: Single Instruction, Multiple Thread.

This means they are incredibly efficient when thousands of threads are all executing the same instruction on different data (e.g., a matrix multiplication).

然而,它们在具有繁重分支或不规则数据访问的工作负载上非常低效。

这是因为“三线分歧”。

如果组中的线程(一个“warp”)具有 if/else 语句的不同分支,则硬件必须连续执行两条路径,而不活跃路径中的线程只需关闭。

If threads in a group (a "warp") take different branches of an if/else statement, the hardware must execute both paths serially, with threads in the inactive path simply turned off.

This kills performance for many important problems.

Neuromorphic Computing:

这是一个由大脑启发的计算模式。

神经形态芯片,如英特尔的Loihi,不是基于钟表和密集矩阵数学。

他们是事件驱动的。

They are event-driven.

"Neurons" fire a "spike" only when their input potential crosses a threshold.

这些尖端旅行到其他“synapses”,然后可能导致其他神经元燃烧。

这是一个极其稀缺的,分支繁重的,无同步的过程。

试图在GPU上模拟这种情况是由于不断的线程分歧而极其低效的。

Trying to simulate this on a GPU is horrifically inefficient due to constant thread divergence.


MLIR为此提供了完美的解决方案。

MLIR is the perfect solution for this.

MLIR为此提供了完美的解决方案。


  1. 在MLIR中可以创建一个“神经形态方言”。
  2. This dialect would have first-class operations for Spike, Synapse, NeuronUpdate.
  3. 开发人员可以使用这些高层次概念在Mojo中编写一个神经形态算法。
  4. The MLIR compiler, with a backend for a specific neuromorphic chip like Loihi, would translate these concepts into the chip's native, event-driven instructions.

这允许一个便携式的高级别的编程模型,用于完全非传统的计算形式。

The CUDA model is not relevant in this domain.

The CUDA model is not relevant in this domain.

Sparse and Graph Data:

Sparse 和 Graph 数据:

许多现实世界的问题涉及稀缺的数据:社交网络,推荐引擎和科学模拟。

代表这些作为密集的矩阵是浪费的。

代表这些作为密集的矩阵是浪费的。

Processing them on GPUs leads to irregular memory access patterns, which defeats the GPU's memory coalescing optimizations and cripples performance.

Again, MLIR provides the answer.

  1. 一个“图形方言”或“稀缺的压缩方言”可以本地表示这些数据结构。
  2. 编译器可以应用专门的优化来处理稀缺性。
  3. For example, it can reorder nodes to improve memory locality or use compressed storage formats.

这使得用Mojo编写的高级算法能够有效地编译任何硬件上的稀缺数据。

This allows a high-level algorithm written in Mojo to be efficiently compiled for sparse data on any hardware.

This is something that is extremely difficult today.

和 CUDA 不可能的旁边。

和 CUDA 不可能的旁边。

Quantum Computing Simulation

量子计算模拟

Simulating a quantum computer on a classical computer is essential for developing and testing quantum algorithms.

最常见的方法是状态矢量模拟。

N 量子系统的状态是由 2^N 复杂数的矢量表示的。

对于仅有50个量子比特,这个矢量有2^50个(超过四亿个)元素,需要佩塔比特的内存。

For just 50 qubits, this vector has 2^50 (over a quadrillion) elements, requiring petabytes of memory.

量子算法是“门”的序列。

每个端口等同于以一个非常大的、非常稀缺的矩阵来倍增大规模状态向量。

This is a workload that is both computationally intensive and memory-bandwidth bound.

NVIDIA 通过其高性能基于 CUDA 的解决方案 cuQuantum 库在这里投入了大量资金。

cuQuantum 在 NVIDIA GPU 上非常快,但它具有经典的 CUDA 限制:

  1. 供应商锁定:您的量子模拟与 NVIDIA 硬件相关联。
  2. 低级优化:编译器只看到矩阵 vector 倍数。
  3. 没有域优势:它没有对量子力学的优化,基于LLVM(语义差距)。

The MLIR/Mojo Advantage for Quantum Simulation:

量子模拟的MLIR/Mojo优势:

MLIR方法允许编译器中更高水平的智能。

  1. 在MLIR中可以定义一个“量子方言”。
  2. 这种方言不会代表门作为矩阵;它会代表它们作为量子对象:Hadamard,CNOT,Toffoli。
  3. 一个开发者会使用这些高级别对象在Mojo中写出他们的量子电路。
  4. MLIR 编译器可以在任何矩阵甚至生成之前执行量子特定的优化。
量子特定的优化


例如,编译器会知道,连续两次应用Hadamard端口(H)是一种身份操作,可以完全消除。

It would know that certain sequences of gates can be "fused" into a single, more efficient gate.

For instance, the compiler would know that applying a Hadamard gate (H) twice in a row is an identity operation and can be completely eliminated.

它会知道某些门的序列可以“合并”成一个单一的,更有效的门。


这是一个完整的优化类别,对于 CUDA 编译器来说是看不见的,由于 LLVM,它只能看到通用矩阵。

This is an entire class of optimization that is invisible to the CUDA compiler, which only sees generic matrices, thanks to LLVM.

在执行这些高级别的算法简化后,MLIR 编译器会将简化电路降低为对目标硬件的优化的稀缺矩阵操作序列。

因为所有这些都是基于MLIR的,所以同样的高级量子电路可以用Mojo编译,以便在NVIDIA GPU、AMD GPU或CPU集群上运行。

Because this is all built on MLIR, the same high-level quantum circuit written in Mojo could be compiled to run on an NVIDIA GPU, an AMD GPU, or a CPU cluster.

这既提供更高的性能(由于更智能的优化)和完全的硬件自由。

Nvidia is investing heavily in quantum simulation hardware and the software stack.

但其CUDA-Q平台仍然基于LLVM。

基于MLIR的Mojo不仅可以提供先进的优化 - 它还提供更简单的编程。

MLIR-based Mojo can not just offer advanced optimization - it also offers simpler programming.

Final Verdict: Today vs. The Inevitable Future

最终判决:今天vs.不可避免的未来

Futurism is the in-thing!

The Verdict Today (2025):

《今天的判决》(2025年)
  1. 库达是山的国王,山是巨大的。
  2. Its mature ecosystem, extensive libraries, and massive community are powerful assets.
  3. 对于已经投资于NVIDIA硬件并需要立即发货的团队,CUDA是务实的选择。
  4. 一个十年统治的惰性是一个强大的力量。
  5. 莫佐还年轻。
  6. 它的生态系统正在以令人难以置信的速度增长,但它还不能匹配CUDA经过战斗测试的图书馆的宽度。

The Verdict for the Long Run:

对于长跑的判决:
  1. 未来是异性化的。
  2. 这不是猜测,这是现实。
  3. 定制人工智能硅的兴起和AMD和英特尔的竞争使供应商锁定成为不可接受的商业和技术风险。
  4. 未来的问题 - 稀缺数据,神经形态人工智能,区块链挖掘和量子计算 - 并不适合当今GPU的僵硬SIMT模型。
  5. MLIR is the only existing, industry-supported architecture designed to solve this problem.
  6. 其由谷歌、苹果、英特尔、AMD和ARM所采用,是其在编译机未来的核心作用的明确信号。
  7. Mojo是唯一一个(至今)利用这种力量的语言。

动作:

  • 解决双语问题
  • 将可用性与性能相结合
  • 提供了一个通往整个MLIR生态系统的门户。
解决双语问题将可用性与性能相结合提供了一个通往整个MLIR生态系统的门户。

从CUDA到基于MLIR的世界的过渡将是渐进的,但它是不可避免的。

It is a fundamental shift from a closed, hardware-centric model to an open, software-defined future.

Mojo的缺点

  1. Mojo仍在开发中。
  2. 它甚至还没有课。
  3. 它的第三方图书馆很少,但以令人难以置信的速度增长。
  4. 它在任何地方都使用Python,但它需要与Python一起进化。
  5. 整个语言还不是开源的,尽管专家说这将很快改变。
  6. 它不支持Windows(目前为止)。
  7. 它需要向 Android、iOS 和 Edge IOT 系统进行端口。

但从长远来看,他会是赢家吗?

I believe it will, and developers will be happier with Mojo than CUDA.

结论

CUDA建立了当今高性能计算的令人印象深刻的宫殿。

CUDA built the impressive palace of today's high-performance computing.

但它是一个笼子。

But it is a cage.

MLIR 和 Mojo 都将每个开发人员的钥匙交给他们,以便他们可以解锁它,并在他们所选择的任何基础上构建未来。

MLIR and Mojo are handing every developer the key to unlock it and build the future on any foundation they choose.

这个基金会将成为MLIR和Mojo。

这个基金会将成为MLIR和Mojo。

The simplest reason - the budget.

预算。

这就是为什么,除非Nvidia开启,而且很快:

这将是Nvidia的主导地位的终结 - 除非他们也接受MLIR!

This will be the end of the dominance of Nvidia - unless they embrace MLIR as well!

I asked for professional wear - does the AI think engineers work in medical labs? Crazy!


参考

Official Project Pages

  • MLIR (Multi-Level Intermediate Representation)
    • Text description: The official homepage for the MLIR project, hosted by LLVM. This is the canonical source for documentation, talks, and the project's overall mission statement.
    • https://mlir.llvm.org/
  • Mojo Programming Language
    • The official documentation for the Mojo programming language from Modular, the company that created it. This is the primary resource for learning the language.[2]
    • https://docs.modular.com/mojo/
  • NVIDIA CUDA Toolkit
  • LLVM Compiler Infrastructure Project
    • The main homepage for the LLVM project, which provides an overview of the entire ecosystem, including Clang, LLDB, and other sub-projects. MLIR is a part of this larger project.
    • https://llvm.org/
  • Chris Lattner's Homepage
    • The personal homepage of Chris Lattner, the creator of LLVM, Clang, Swift, MLIR, and Mojo. It provides his work history and links to his talks and papers, offering direct insight into the creation of these technologies.
    • https://nondot.org/sabre/

AI and Attention Mechanism (FlashAttention)

  • FlashAttention Original Paper (arXiv)
    • The original scientific paper, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," which introduced the algorithm. This is the primary source for understanding the technical details and performance benefits.
    • https://arxiv.org/abs/2205.14135
  • FlashAttention-2 Paper (arXiv)
    • The follow-up paper describing FlashAttention-2, which details further optimizations for parallelism and work partitioning to achieve even greater speedups on modern GPUs.
    • https://arxiv.org/abs/2307.08691
  • FlashAttention GitHub Repository

Quantum Computing Simulation

Specialized Hardware (Neuromorphic & ASICs)

  • Intel Neuromorphic Computing Overview
  • CIRCT (Circuit IR Compilers and Tools) Project
    • The official homepage for the CIRCT project, an LLVM/MLIR incubator looking to apply compiler technology to hardware design, including High-Level Synthesis (HLS) for FPGAs and ASICs.
    • https://circt.llvm.org/
  • CIRCT GitHub Repository
    • The official GitHub repository for the CIRCT project, containing the source code, dialects, and tools for hardware compiler design.
    • https://github.com/llvm/circt

Google AI Studio 用于这篇文章的概述和研究,你可以在这里找到它:

https://aistudio.google.com/

Google AI Studio 用于这篇文章的概述和研究,你可以在这里找到它:

https://aistudio.google.com/

所有图片都是由作者通过NightCafe Studio免费生成的,可在下面的链接下找到:

https://creator.nightcafe.studio/

所有图片都是由作者通过NightCafe Studio免费生成的,可在下面的链接下找到:

https://creator.nightcafe.studio/


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks