මෙම නව භාෂාව NVIDIA හි GPU Monopoly මරන්න පුළුවන්

All Images AI-generated by the author for free with NightCafe Studio - සබැඳි සඳහා Footer බලන්න.

All Images AI-generated by the author for free with NightCafe Studio - සබැඳි සඳහා Footer බලන්න.

උසස් කාර්ය සාධක පරිගණකයේ යුගය එකම නමකින් සකස් කර ඇත:පුදුම

පුදුම

NVIDIA හි වේදිකාව GPU හි බල ශක්තිය අවලංගු කර, de facto සම්මත බවට පත් විය.

වසර 10 කට වැඩි කාලයක් තිස්සේ, GPU වැඩසටහන් කිරීම CUDA හි වැඩසටහන් කිරීමයි.

කෙසේ වෙතත්, මෙම ප්රධානත්වය, තනි සැපයුම්කරු බවට ප්රගතිය වසා දැමීම සඳහා කූඩුවක් නිර්මාණය කර ඇත.

නමුත් අද, 2025 මැද - දේවල් වෙනස් වේ.

නමුත් අද, 2025 මැද - දේවල් වෙනස් වේ.

The computing world is now undergoing a radical transformation towards heterogeneity.

අපි විශේෂිත උපාංග සංවර්ධනය දකිනවා:

Intel Gaudi Series:

Intel's Gaudi processors are designed specifically for deep learning training and inference, offering a competitive alternative to Nvidia's GPUs.
AMD Instinct MI Series:

AMD's MI series of GPUs is designed for high-performance computing and AI workloads, providing an alternative to Nvidia's data center GPUs.
Groq Tensor Streaming Processor (TSP):

Groq's TSP architecture is designed for low-latency inference and high throughput, particularly for large language models.
Google TPUs (Tensor Processing Units):

Google's TPUs are custom-designed chips optimized for machine learning workloads, particularly in Google's cloud infrastructure.
AWS Trainium:

AWS Trainium is a chip designed for machine learning training, offering high performance and cost-effectiveness.

සෑම දිනකම Custom Silicon Chips හදන Startups වැඩි වෙනවා.

සෑම දිනකම Custom Silicon Chips හදන Startups වැඩි වෙනවා.

මෙම නව, විවිධ පෘථිවිය නව වැඩසටහන් ආකෘතිය අවශ්ය වේ.

Multi-Level Intermediate Representation (MLIR) සහ Mojo වැඩසටහන් භාෂාව

This is not just another competitor; they represent a fundamental paradigm shift.

මෙය ඕනෑම උපාංගය සඳහා මෘදුකාංග නිර්මාණය, පරිශීලක කිරීම සහ ස්ථාපනය කිරීමේ ආකාරයට විප්ලවය වේ.

This is a revolution in how we design, optimize, and deploy software for any hardware.

This article will deeply explore the architectural chasm between CUDA and MLIR.

අපි සම්පූර්ණ, ක්රියාකාරී කේත උදාහරණ භාවිතා කරනු ඇත, සැබෑ, ක්රියාකාරී සමාලෝචන ලබා දෙනු ඇත.
අපි MLIR තම ගෞරවනීය පෙරදිග, LLVM මත ප්රතිඵලයක් වන්නේ ඇයි යන්න සොයා බලමු.
අපි Mojo හොඳම දිගුකාලීන විසඳුම බව ප්රකාශ කරනු ඇත.
අපි විශ්ලේෂණය කරමු ඇයි මෙම නව කට්ටලය වියදම් හා වේගය සඳහා ක්රීඩාව වෙනස් කිරීමක්.

මෙම බලපෑම එවැනි ප්රධාන වර්ධන ප්රදේශවලට පුළුල් වේ:Generative AI, Quantum Computingසහ පවාBlockchain.

අපිත් අනාගතය දිහා බලාගෙන ඉන්නවා.mining ASICs,Neuromorphic Computingසහspecialized hardwareGPUs දුර්වල ලෙස කටයුතු කරන දුර්වල දත්ත ක්රියාත්මක කිරීම සඳහා.

මෙය එක් යුගයේ අවසානය සහ නව යුගයේ උදාහරණයකි.

මෙය එක් යුගයේ අවසානය සහ නව යුගයේ උදාහරණයකි.

මෙම සංවර්ධනයේ ප් රමාණය තේරුම් ගැනීම සඳහා, අපි මුලින්මunderstand the four key players.

1. CUDA: The Powerful, Proprietary Incumbent

CUDA: The Powerful, Proprietary Incumbent – ශක්තිමත්, අයිතිකාරී

CUDA stands for Compute Unified Device Architecture.

එය NVIDIA හි සංකීර්ණ පරිගණක වේදිකාව සහ වැඩසටහන් ආකෘතියයි.

එය සංවර්ධකයින් C++ වැනි කේතය, NVIDIA GPUs මත ක්රියාත්මක වන kernels ලෙස හැඳින්විය හැක.

CUDA's Strengths:

කුසල්ගේ ශක්තිය :

Its ecosystem of libraries is mature and unmatched:

Mathematical Libraries:
- cuBLAS: For basic linear algebra subprograms (BLAS).
- cuRAND: For random number generation.
- cuFFT: For Fast Fourier Transforms.
- cuSPARSE: For sparse matrix operations.
- cuTENSOR: For tensor operations.
- cuSOLVER: For dense and sparse direct solvers.
Parallel Algorithm Libraries:
- nvGRAPH: For graph algorithms.
- Thrust: For parallel algorithms and data structures.
Communication Libraries:
- NVSHMEM: For partitioned global address space (PGAS) programming.
- NCCL: For multi-GPU and multi-node collective communication.
Deep Learning Libraries:
- cuDNN: For deep neural network computations.
- TensorRT: For optimized deep learning inference.
- Riva: For conversational AI.
- DALI: For data loading and augmentation for deep learning.

එය උපාංගය මත සෘජු, අඩු මට්ටමේ පාලනය සපයයි, විශේෂඥයන් සඳහා ප්රගතිශීලී කාර්ය සාධනය ලබා දෙයි.

එහි දිගු ඉතිහාසය විශාල ලේඛන සහ සහාය සහිත විශාල සමාජයක් ගොඩනැගුවා.

Its long history has built a massive community with vast documentation and support.

CUDA's Fatal Flaw: The Cage

Vendor Lock-In: CUDA code runs only on NVIDIA GPUs.

විතරක්

මෙය සංවර්ධකයින් සහ මුළු කර්මාන්තයන් එක්, මිල අධික උපාංග සැපයුම්කරු බවට පත් කරයි.

එය රැකියාව සඳහා හොඳම උපාංගය තෝරා ගැනීමට නිදහස සීමා කරයි.

The Two-Language Problem: A Major Bottleneck in AI and Scientific Computing (විශ්ව භාෂා දෙකක ගැටලුව: AI සහ විද්යාත්මක පරිගණකයේ ප් රධාන බිත්ති)

පර්යේෂකයන් Python වැනි උසස් මට්ටමේ භාෂාවක ප් රොටොටයිප් කිරීම සඳහා එහි සරලතාවය සහ වේගය.

නමුත් නිෂ්පාදනය සඳහා, ප්රතිඵලදායී කේතය සම්පූර්ණයෙන්ම අඩු මට්ටමක C++ / CUDA ලෙස පරිවර්තනය කළ යුතුය.

But for production, performance-critical code must be completely rewritten in low-level C++/CUDA.

මෙය වේදනාකාරී හා වියදම්කාරී සම්බන්ධතාවයක් නිර්මාණය කරයි, පර්යේෂණයෙන් ස්ථාපනය කිරීමට මාර්ගය නතර කරයි.

වැඩසටහන් සංකීර්ණත්වය:

CUDA ශක්තිමත් නමුත් ප් රසිද්ධව සංකීර්ණ හා කථිකයි.

සංවර්ධකයාට අවශ් ය වන්නේ CPU (Host) සහ GPU (Device) අතර දත්ත මාරු කිරීම.

පරිගණකයා ද මෘදුකාංග සැලසුම්කරුවෙකු විය යුතුය, thread blocks, grids, සහ synchronization කළමනාකරණය.

මෙම සංකීර්ණත්වය වේගවත් ඉගෙනුම් කෙළවරක් සහ පුළුල් බැග් වල නිතරම මූලාශ්රය වේ.

2. LLVM: The Foundation and Its "Semantic Gap”

LLVM: පදනම සහ එහි "සෙමාන්තික වෙනස"

LLVM ව් යාපෘතිය යනු modular හා reusable compiler technologies එකකි.

එහි මූලධර්මය වන්නේ LLVM Intermediate Representation (IR) යන භාෂාවයි.

LLVM වර්තමානයේ පරිගණක පිටුපස, විශේෂයෙන් CPUs සඳහා ප්රමිතිය බවට පත් විය.

C++ සඳහා Clang වැනි පරිගණක ප්රතිසංස්කරණ පරිවර්තනය LLVM IR බවට මූලාශ් ර කේතය පරිවර්තනය කරයි.

එවිට LLVM backend මෙම IR optimizes වන අතර එය විශේෂ CPU සඳහා යන්ත්ර කේතයක් බවට පරිවර්තනය කරයි.

මෙම මොඩියුලභාවය තම කාලය සඳහා විප්ලවීය විය.

කෙසේ වෙතත්, LLVM යනු CPU-centric ලෝකයක් සඳහා නිර්මාණය කරන ලදී.

එහි IR මට්ටමේ අඩු මට්ටමේ අළුත් ලෝකයේ heterogeneous උපාංග.

එය මූලාශ්ර කේතයෙන් ප්රධාන උසස් මට්ටමේ තොරතුරු අහිමි වන අතර එය "සෙමාන්තික විලාසිතා" ලෙස හඳුන්වනු ලබන ගැටලුවකි.

උදාහරණයක් ලෙස, TensorFlow ආකෘතිය සකස් කරන විට, මෙහෙයුමක් Convolution බව දැන ගැනීම අහිමි වේ.

LLVM IR දකිනවා පමණක් පොදු රැස්වීම් හා අර්බුදික උපදෙස්.

මේ නිසා පරිවර්තනකයා බලවත්, ප්රාදේශීය විශේෂ ආකෘති පරිපූර්ණ කිරීම සිදු නොකරයි.

එය තවදුරටත් වැඩසටහන්කරුගේ උසස් මට්ටමේ අරමුණ තේරුම් නොගනී.

මේක තමයි “Semantic Gap Problem” කියන ප් රශ්නය.

මේ ප් රශ්නය තමයි මාලිගාව විසඳලා තියෙන්නේ.

It loses crucial high-level information from the source code, a problem known as the "semantic gap."

For example, when compiling a TensorFlow model, the knowledge that an operation is a Convolution is lost.

LLVM IR only sees a generic collection of loops and arithmetic instructions.

This prevents the compiler from performing powerful, domain-specific optimizations.

It no longer understands the programmer's high-level intent.

This is the essence of the “semantic gap problem.”

And this problem is what MLIR has Solved.

3. MLIR: The Universal Translator for Hardware

MLIR (Universal Translator for Hardware) පරිවර්තකය

MLIR Google හි TensorFlow CPUs, GPUs සහ ඔවුන්ගේ TPUs සඳහා සකස් කිරීම සඳහා අවශ්යතාවයෙන් උපත ලැබේ.

ඔවුන් දැනගත්තා LLVM හි තනි, අඩු මට්ටමේ IR ප් රමාණවත් නොවන බව.

MLIR හි ප්රතිඵලයක් වන්නේ විවිධ IRs සකස් කිරීම සහ සකස් කිරීම සඳහා සංකීර්ණ මූලාශ්රයයි.

මෙම composable IRs dialects ලෙස හැඳින්වේ.

භාෂා

MLIR යනු විශ්ව පරිවර්තකය මෙන්, උසස් මට්ටමක සංකල්පවලින් අඩු මට්ටමක යන්ත්ර තොරතුරු දක්වා සෑම දෙයක්ම මෘදුකාංගයකි.

උසස් මට්ටමක භාෂා ප් රදර්ශනයෙන් සෘජුවම domain-specific අර්ථ දැක්විය හැක.

For example, a "TensorFlow dialect" has an operation for tf.conv2d.

A "Linear Algebra dialect" has an operation for linalg.matmul.

මෙය LLVM විසින් ඉවත් කරන ප්රශ්නීය සමිති තොරතුරු ඉතිරි කරයි.

මේ සඳහා යොදා ගන්නේ ශක්තිමත් පරිගණක උපාය මාර්ගයක් ලෙස හඳුන්වනු ලබයි.Progressive අඩුපාඩු* * *

Progressive අඩුපාඩු

පරිවර්තකය ආරම්භ වන්නේ උසස් මට්ටමේ භාෂා ප්රදර්ශනයකින්.
එය මෙම භාෂාව මත උසස් මට්ටමේ, domain-specific optimizations සිදු කරයි.
ඉන්පසු එය මධ් යම භාෂා ගණනාවක් හරහා කේතය මඟින් මඟින් "පරමාණයෙන්" අඩු කරයි.
සෑම මැද පෙරදිග භාෂාවම තමන්ගේම සුවිශේෂී පහසුකම් සිදු කරයි.
අවසාන වශයෙන්, එය අවසන් යන්ත්ර කේත නිෂ්පාදනය සඳහා LLVM IR භාෂාව වැනි පහළ මට්ටමේ භාෂාවකට ළඟා වේ.

This process preserves high-level context for as long as possible.

This enables vastly superior optimizations for any hardware target.

MLIR යනු උසස් මට්ටමේ භාෂා සහ විවිධ සිලිකන් අතර අතුරුදහන් වන සබැඳියකි.

MLIR is the missing link between high-level languages and diverse silicon.

4. Mojo: The User-Friendly Face of MLIR's Power

Mojo: The User-Friendly Face of MLIR's Power (මොගෝ: MLIR's බලයගේ පරිශීලක මිත් රශීලී මුහුණ)

MLIR ශක්තිමත්, සංකීර්ණ එන්ජින් නම්, Mojo මෘදු, සංවේදී පරිශීලක පරිගණකය වේ.

Mojo නිර්මාණය කර ඇත්තේ LLVM සහ Swift භාෂාවෙහි මුල් ආකෘතියෙකු වන Chris Lattner විසින් ය.

Mojo නිර්මාණය කර ඇත්තේ LLVM සහ Swift භාෂාවෙහි මුල් ආකෘතියෙකු වන Chris Lattner විසින් ය.

එය පළමු මූලධර්ම වලින් නිර්මාණය කර ඇත්තේ MLIR යුගයට පරිපූර්ණ භාෂාවක් බවට පත්වීමයි.

In this regard, it is the most technologically advanced language today.

Rust පවා LLVM මත පදනම් වන අතර LLVM හි සියලුම දුර්වලතා ඇත.

Even Rust is based on LLVM and has all of LLVM’s shortcomings.

Mojo අද MLIR මත පදනම්ව ඇති එකම ප්රධාන වැඩසටහන් භාෂාවයි.

Mojo is the only major programming language today based on MLIR.

Mojo's Key Features:

Python වල Superset එක

Mojo සම්පූර්ණයෙන්ම සපුරාලීම සඳහා ඉලක්ක කර ඇත දැනට පවතින Python ආර්ථික පද්ධතිය.
This is a killer feature!
එය සංවර්ධකයාට NumPy, Pandas, හෝ Matplotlib වැනි ඕනෑම Python පුස්තකාලය ඇතුළත් කිරීමට සහ භාවිතා කිරීමට ඉඩ සලසයි.
එය නව භාෂාවන්ට මුහුණ දෙන "කීතල ආරම්භය" ප්රශ්නය සම්පූර්ණයෙන්ම වටහා ගනී Python හි පුළුල් පරිසර පද්ධතියට පිවිසෙනු ඇත.

සැබෑ පද්ධති වැඩසටහන විශේෂාංග:

Python වෙනුවට, Mojo යනු ශක්තිමත් ස්ටීක් ටයිප් කිරීම සහිත පරිකූල භාෂාවකි.
මෙම ක්රියාකාරී කාලය වැරදි සම්පූර්ණ පන්ති අවලංගු කරයි සහ C++ මට්ටමේ ක්රියාකාරීත්වය වැඩි දියුණු කිරීම සඳහා හැකියාව ලබා දෙයි.
එය මෑත මතකය කළමනාකරණ සංකල්ප, අයිතිවාසිකම් සහ ණයට ගැනීම (Rust සිට) මතකය ආරක්ෂාව සඳහා කිසිදු කුණු කට්ටලයක් නොමැතිව.

පළමු පන්තියේ MLIR ඇතුළත් කිරීම:

Mojo විසින් MLIR හි සම්පූර්ණ බලශක්ති සෘජුවම සංවර්ධකයාට පෙන්නුම් කරයි.
වැඩසටහන්කරුවන් ඔවුන්ගේ යෙදුම සඳහා උසස් මට්ටමේ Pythonic කේතයක් ලිවිය හැකිය.
උපරිම කාර්ය සාධනය අවශ්ය විට, ඔවුන් නිශ්චිත MLIR භාෂා භාවිතය හා පහළ මට්ටමේ කොන්ලර් ලිවීමට පහළ විය හැක.
මේ සියල්ල එකම භාෂාවෙන්, එකම ගොනුව තුළ කළ හැක.

මේ සියල්ල එකම භාෂාවෙන්, එකම ගොනුව තුළ කළ හැක.

Mojo elegantly solves the "two-language problem."

Full Code Examples and Analysis

සම්පූර්ණ කේත උදාහරණ සහ විශ්ලේෂණ

Theory එක දෙයක්, practice එක දෙයක්.

පහත දැක්වෙන සම්පූර්ණ, වැඩ කේත උදාහරණ -

මෙම පර්යේෂණ දෙක අතර ගැඹුරු වෙනස පෙන්වයි.

Example 1: Matrix Multiplication

උදාහරණයක් 1: Matrix Multiplication

එය උසස් කාර්ය සාධන පරිගණකයේ "Hello, World!" සහ එය පැහැදිලිව සෑම වේදිකාවේ මූලික ආකෘතිය හෙළි කරයි.

The Full CUDA Implementation

මෙය matrix multiplication සඳහා සම්පූර්ණ, පරිවර්තනය කළ හැකි CUDA වැඩසටහනකි.

(සංස්කරණය C++)

// Filename: matmul.cu
// To compile: nvcc matmul.cu -o matmul_cuda

#include <iostream>
#include <vector>
#include <cuda_runtime.h>

// Helper to check for CUDA errors
#define CUDA_CHECK(err) { \
    cudaError_t err_code = err; \
    if (err_code != cudaSuccess) { \
        std::cerr << "CUDA Error: " << cudaGetErrorString(err_code) << " at line " << __LINE__ << std::endl; \
        exit(EXIT_FAILURE); \
    } \
}

// CUDA Kernel for Matrix Multiplication (Device Code)
__global__ void matrixMulKernel(float* C, const float* A, const float* B, int N) {
    // Calculate the global row and column index of the element
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Boundary check to avoid accessing out-of-bounds memory
    if (row < N && col < N) {
        float p_value = 0.0f;
        // Each thread computes one element of the result matrix C
        for (int k = 0; k < N; ++k) {
            p_value += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = p_value;
    }
}

// Main function (Host Code)
int main() {
    const int N = 256;
    const int size = N * N * sizeof(float);

    // Step 1. Allocate host memory
    std::vector<float> h_A(N * N);
    std::vector<float> h_B(N * N);
    std::vector<float> h_C(N * N);

    // Initialize host matrices
    for (int i = 0; i < N * N; ++i) {
        h_A[i] = static_cast<float>(rand()) / RAND_MAX;
        h_B[i] = static_cast<float>(rand()) / RAND_MAX;
    }

    // Step 2. Allocate device memory
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc((void**)&d_A, size));
    CUDA_CHECK(cudaMalloc((void**)&d_B, size));
    CUDA_CHECK(cudaMalloc((void**)&d_C, size));

    // Step 3. Copy matrices from host to device
    std::cout << "Copying data from host to device..." << std::endl;
    CUDA_CHECK(cudaMemcpy(d_A, h_A.data(), size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B.data(), size, cudaMemcpyHostToDevice));

    // Step 4. Define kernel launch configuration
    // Use 16x16 threads per block, a common choice
    dim3 threadsPerBlock(16, 16);
    // Calculate the number of blocks needed in each dimension
    dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y);

    // Step 5. Launch the kernel on the device
    std::cout << "Launching kernel..." << std::endl;
    matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_C, d_A, d_B, N);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize()); // Wait for the kernel to finish

    // Step 6. Copy the result matrix back from device to host
    std::cout << "Copying result from device to host..." << std::endl;
    CUDA_CHECK(cudaMemcpy(h_C.data(), d_C, size, cudaMemcpyDeviceToHost));

    // Step 7. Free device memory
    CUDA_CHECK(cudaFree(d_A));
    CUDA_CHECK(cudaFree(d_B));
    CUDA_CHECK(cudaFree(d_C));

    std::cout << "CUDA Matrix Multiplication finished successfully." << std::endl;
    // (Optional: Add verification step here)

    return 0;
}

Analysis of the CUDA Code:

Analysis of the CUDA Code:

කේතය පවත්වාගෙන යනු ඇත boilerplate සහ පහළ මට්ටමේ කළමනාකරණය.

පියවර 1, 2, 3, 6 සහ 7 යනු CPU / GPU සීමාව හරහා මතකය කළමනාකරණය කිරීම සඳහා පමණි.

මෙය නුසුදුසු, වැරදි ප්රතිඵලදායී වන අතර, මූලික ඇල්ගාටරි අන්ධ කරයි.

ඒglobalkeyword, blockIdx, threadIdx, සහ <<<...>>> සංකේතය CUDA-පෞද්ගලික උපාංගික අවබෝධයන් වේ.

මෙම කේතය මූලික වශයෙන් සහ සදාකාලිකව NVIDIA හි උපාංග ආකෘතියට සම්බන්ධ වේ.

සැබෑ ආකෘතිය - තුන් හවුල් - සම්පූර්ණ කේතයේ කුඩා කොටසක් වේ.

මෘදුකාංග කළමනාකරණය සඳහා පරිගණකයාගේ මානසික ප් රමාණය වියදම් කර ඇත, ප් රශ්නයම නොවේ.

The programmer's mental overhead is spent on hardware management, not on the problem itself.

The Full Mojo Implementation

සම්පූර්ණ Mojo භාවිතය

මෙම Mojo අනුවාදය හුඟාක් සරලතාවය සහ ශක්තිය සමඟ එකම ප්රතිඵලයක් සාර්ථක කරයි.

(මොෂා)

# Filename: matmul.mojo
# To run: mojo matmul.mojo

from memory import DType, Tensor
from random import rand
from time import now

fn matmul_naive(C: Tensor[DType.float32], A: Tensor[DType.float32], B: Tensor[DType.float32]):
    """A naive, high-level implementation of matrix multiplication."""
    let N = A.dim(0)
    let M = A.dim(1)
    let P = B.dim(1)

    for i in range(N):
        for j in range(P):
            var sum: Float32 = 0.0
            for k in range(M):
                sum += A.load(i, k) * B.load(k, j)
            C.store(i, j, sum)

fn main():
    let N = 256
    
    # 1. Allocate and initialize tensors.
    # Mojo's Tensor handles memory allocation automatically.
    # The compiler will place it in the most appropriate memory space.
    var A = Tensor[DType.float32](N, N)
    var B = Tensor[DType.float32](N, N)
    var C = Tensor[DType.float32](N, N)

    for i in range(N):
        for j in range(N):
            A.store(i, j, rand[DType.float32]())
            B.store(i, j, rand[DType.float32]())

    print("Starting Mojo Matrix Multiplication...")
    
    let start_time = now()
    
    # 2. Call the function.
    # The MLIR-based compiler optimizes this high-level code.
    # It can automatically tile, vectorize, and parallelize this code
    # for the target hardware (CPU, GPU, etc.).
    matmul_naive(C, A, B)

    let end_time = now()
    let duration_ms = (end_time - start_time) / 1_000_000.0

    print("Mojo Matrix Multiplication finished successfully.")
    print("Execution time:", duration_ms, "ms")
    # (Optional: Print a corner of the result matrix to verify)
    print("Result C[0,0]:", C.load(0,0))
}

And that is all!

The Mojo Approach is Far Superior

Mojo ප්රවේශය ඉතා ඉහළයි

වැඩසටහන සහ අවධානය:

The Mojo code is clean and expresses the algorithm directly.
පරිගණකයා සැලකිලිමත් වන්නේ කුමක්ද (matematics), කෙසේද (memory transfers) නොවේ.
There is no manual cudaMalloc, cudaMemcpy, or cudaFree.
ඒ වැරදි මුළු පන්තියම අතුරුදහන් වෙලා.

ප්රතිඵල සමඟ abstraction:

සරලව නිපදවන ලුහුබැඳීම් සිදු කරන දේ නොවේ.
The MLIR-based compiler performs sophisticated transformations.
එයින් මෙම සරල කේතය අතිශයින් පහසු කේතයක් බවට පත් වේ.
It can apply tiling, vectorization, and parallelization automatically.
පරිගණකයා පරිගණකයට මාර්ගෝපදේශ කිරීම සඳහා @vectorize හෝ @parallelize වැනි ඉඟි එකතු කළ හැකි අතර, සංකීර්ණතාවයකින් තොරව පාලනය ලබා ගත හැකිය.

Portability (The Ultimate Advantage):

ඒක තමයි ප් රධාන ප් රශ්නය.
The same matmul.mojo file can be re-compiled to run on an NVIDIA GPU, an AMD GPU, an Intel CPU with AVX512, or a Google TPU.
සංකේතය එකම වන අතර, compiler backend වෙනස් වේ.
CUDA කේතය සෑම නව මෘදුකාංග ඉලක්කයක් සඳහා සම්පූර්ණ, වියදම්කාරී නැවත ලිවීමක් අවශ්ය වේ.
Mojo "අධිමත් පවත්වාගෙන යාම" ලබා දෙයි, සැපයුම්කරු ලැක්කන් බිඳ දැමීම සහ කේතය අනාගතය තහවුරු කිරීම.

Mojo "අධිමත් පවත්වාගෙන යාම" ලබා දෙයි, සැපයුම්කරු ලැක්කන් බිඳ දැමීම සහ කේතය අනාගතය තහවුරු කිරීම.

MLIR මත පදනම්ව Mojo LLVM මත පදනම්ව CUDA මාරු කරනු ඇත, සහ සංවර්ධකයින් මෙම වෙනස භුක්ති විඳිනු ඇත!

MLIR-based Mojo is undeniably set to replace LLVM-based CUDA, and developers will enjoy the change!

For more on Mojo, refer to the article below:

Example 2: Gen AI and the Transformer Attention Mechanism

උදාහරණයක් 2: Gen AI සහ Transformer අවධානය යන්ත්රය

The "attention" mechanism is the heart of models like GPT-4 and is a major computational bottleneck.

ඒ සඳහා ප් රවේශම් වීම ප් රමුඛයි.

Optimizing it is critical.

The CUDA Implementation (Conceptual FlashAttention)

FlashAttention is a landmark algorithm that manually and expertly orchestrates data movement between the GPU's slow main memory (HBM) and its fast on-chip memory (SRAM) to reduce bottlenecks.

The real code is thousands of lines long and incredibly complex.

සම්පූර්ණ algorithm implementation components වෙත සබැඳි පහත සඳහන් වේ:

https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/flash_fwd_kernel.h

https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/flash_api.cpp

එකතුවෙන් ඒවා 3000ක් පමණ දිගු වේ.

The repository contains thousands of files.

The learning curve and the onboarding curve are both steep.

පහත දැක්වෙන සරල පරිවර්තනය (AI-generated) ලබා ඇත:

(සංස්කරණය C++)

// This is a simplified conceptual view of a FlashAttention-style CUDA kernel.
// The actual implementation is far more complex.

template<typename Kernel_traits>
__global__ void flash_attention_fwd_kernel(Flash_fwd_params params) {

    // 1. Incredibly complex setup code
    // Calculates dozens of pointers and indices for HBM and shared memory (SRAM)
    const int block_row_idx = blockIdx.x;
    const int head_idx = blockIdx.y;
    // ... many more calculations ...

    // 2. Explicitly allocate shared memory tiles for Q, K, V
    // The developer must manage this limited resource manually.
    extern __shared__ char smem[];
    float* sQ = (float*)smem;
    float* sK = sQ + kTileM * kTileK;
    float* sV = sK + kTileN * kTileK;

    // 3. Main loop over the sequence, manually loading blocks
    for (int k_block_idx = 0; k_block_idx < params.k_num_blocks; ++k_block_idx) {

        // Manually orchestrate asynchronous loads from HBM into SRAM
        // to hide memory latency. This is extremely difficult to get right.
        load_qkv_block_from_hbm(params, ...);
        __syncthreads(); // Hard synchronization barrier

        // Manually perform matrix multiplication in fast SRAM
        compute_sram_matmul(sQ, sK, ...);

        // Recompute softmax "online" to avoid writing the huge intermediate
        // attention score matrix back to slow HBM. This is the core trick.
        compute_online_softmax(...);
        __syncthrows();

        // Update the output block
        update_output_block(sV, ...);
    }

    // 4. Manually write the final output block back to HBM
    store_output_to_hbm(params, ...);
}

Analysis of the CUDA/FlashAttention Approach:

CUDA / FlashAttention ප්රවේශය පිළිබඳ විශ්ලේෂණය:

එය අත්හදා බැලීම්, මෘදුකාංග-පෞද්ගලික ඉංජිනේරු ක්ෂේත්රයේ මාදිලියකි.
එය අතින් වැඩසටහන් කළ යන්ත්රය ලෙස GPU ප්රතිකාර කිරීමෙන් අපූරු කාර්ය සාධක ලබා දෙයි.
This makes the code virtually unreadable, unmaintainable, and unportable.
Only a handful of world-class experts can write or modify such code.
එය වෘත්තීය පරිසර පද්ධතිය තුළ ක්රියාකාරිත්වයේ උපාධියයි, නමුත් සංකීර්ණතාවය සහ කාර්යක්ෂමතාවයේ උපාධියයි.

The Conceptual Mojo Implementation

Mojo ක්රියාත්මක කිරීම

Mojo පරිවර්තනය එකම දේ ප් රකාශ කරයිAlgorithmic අදහසක් (tiling, online softmax) at a high level, delegating the hardware orchestration to the MLIR compiler.

(මහත්තයාගේ )

from memory import DType, Tensor
from algorithm import parallelize

struct AttentionParams:
    var is_causal: Bool
    # ... other model parameters

# This function is a high-level, portable description of the FlashAttention algorithm.
fn flash_attention[T: DType](Q: Tensor[T], K: Tensor[T], V: Tensor[T], params: AttentionParams) -> Tensor[T]:
    # Define problem dimensions from input tensors
    let num_batches = Q.dim(0)
    let num_heads = Q.dim(2)
    let seqlen_q = Q.dim(1)
    let seqlen_k = K.dim(1)
    
    # Define tunable tiling parameters. The compiler can use these as hints.
    alias BLOCK_M: Int = 128
    alias BLOCK_N: Int = 64

    # The output tensor
    var O = Tensor[T](Q.dims)

    # The @parallelize decorator tells the compiler to map this function
    # over the available hardware parallelism (e.g., CUDA thread blocks or CPU cores).
    @parallelize(num_batches * num_heads)
    fn compute_head(batch_idx: Int, head_idx: Int):
        
        # Define per-worker accumulators. The compiler will map these
        # to the fastest available memory (e.g., registers or SRAM).
        var o_i = Tensor[T](seqlen_q, V.dim(3))
        var l_i = Tensor[T](seqlen_q) # Stores the denominator of the softmax
        var m_i = Tensor[T](seqlen_q) # Stores the max of each row for stable softmax
        o_i.zero()
        l_i.fill(0.0)
        m_i.fill(-50000.0) # Negative infinity

        # Loop over blocks of the Key/Value sequence
        for j in range(0, seqlen_k, BLOCK_N):
            # 1. Load tiles of K and V.
            # The compiler is responsible for generating the optimal code
            # to move this data from main memory to fast memory.
            let k_j = K.load_tile[BLOCK_N](batch_idx, j, head_idx)
            let v_j = V.load_tile[BLOCK_N](batch_idx, j, head_idx)
            
            # Loop over blocks of the Query sequence
            for i in range(0, seqlen_q, BLOCK_M):
                # 2. Load tile of Q.
                let q_i = Q.load_tile[BLOCK_M](batch_idx, i, head_idx)
                
                # 3. Compute attention scores for the tile. This is a simple matmul.
                let s_ij = q_i @ k_j.transpose()
                
                # Causal masking for decoder models like GPT
                if params.is_causal:
                    # Algorithmic logic, no hardware specifics
                    apply_causal_mask(s_ij, i, j)

                # 4. Perform the "online softmax" update.
                # This is pure mathematical logic, not memory management.
                let m_ij = row_max(s_ij)
                let p_ij = exp(s_ij - m_ij)
                let l_ij = row_sum(p_ij)
                
                let m_new = max(m_i, m_ij)
                let l_new = exp(m_i - m_new) * l_i + exp(m_ij - m_new) * l_ij

                # Update output tile
                o_i = (l_i / l_new * exp(m_i - m_new)) * o_i + (exp(m_ij - m_new) / l_new) * (p_ij @ v_j)

                # Update softmax stats
                l_i = l_new
                m_i = m_new

        # 5. Store the final output. The compiler manages the write-back.
        O.store_tile(batch_idx, head_idx, o_i)
    
    compute_head()
    return O

එක ෆයිල් එකක්

100 කට වඩා අඩුයි.

No brain-racking dependencies.

ඇත්ත වශයෙන්ම, මෙය හුදෙක් ඇල්ගාටරි ය, නමුත් තැන්පතු, එම ඇල්ගාටරි CUDA සමග 3000 LOC ක් පමණ ගත්තේය!

Of course, this is just the algorithm, but in the repository, the same algorithm took nearly 3000 LOC with CUDA!

දැන් ඔයාට තේරෙනවා වෙනස:

So now you understand the difference:

Mojo is Game-Changing for AI:

Mojo ක්රීඩාව වෙනස් කිරීම සඳහා AI:

Separation of Concerns:

Mojo code කියන්නේ algorithm එකක්.
CUDA කේතය අත්හදා බැලූ මෘදුකාංග මෙහෙයුම් විස්තර කරයි.
මෙය ගැඹුරු වෙනසකි.
Mojo පරිගණකයා ඇල්ගාටරි වැඩි දියුණු කිරීම සඳහා අවධානය යොමු කළ හැකිය:
MLIR පරිවර්තකය එය සිලිකන් බවට සකස් කිරීම සඳහා අවධානය යොමු කරයි.

Research Velocity and Maintainability:

AI පර්යේෂකයෙක් මෙම Mojo කේතය පහසුවෙන් තේරුම් ගත හැකි අතර එය නව අදහසක් පරීක්ෂා කිරීම සඳහා වෙනස් කළ හැකිය.
CUDA කේතය වෙනස් කිරීම පුංචි හැකියාවක් අවශ්ය විශාල, කාලය ගත කරන ඉංජිනේරු ව්යාපෘතියක් වනු ඇත.
මෙය පර්යේෂණ හා සංවර්ධනය චක් රය දැඩිව වේගවත් කරයි.

Hardware Freedom:(අපේ වැදගත්ම දේ )

මෙම Mojo කේතය NVIDIA සම්බන්ධ නොවේ.
It can be compiled to run on:
- AMD GPUs
- Google TPUs
- Intel Gaudi
- Custom AI chips.
- Any architecture there is!
MLIR's dialects can be extended to support any new hardware:
Making the Mojo code truly future-proof.

මෙය NVIDIA හි උසස් ප්රතිඵල AI හි මානසිකත්වය විනාශ කරයි සහ වියදම් අඩු කරයි.

This breaks the NVIDIA monopoly on high-performance AI and will drive down costs.

Specialized Hardware and Future Domains

Specialized Hardware and Future Domains

CUDA ආකෘතියේ සීමාවන් තවදුරටත් පෙනෙනු ඇත, අපි පරිගණකයේ අනාගතය සඳහා සම්ප්රදායික ගැඹුරු රැකියාවක් පිටතට බැලුවහොත්.

The limitations of the CUDA model become even more apparent when we look beyond traditional dense workloads to the future of computing.

MLIR / Mojo මෙම අනාගතය සඳහා නිර්මාණය කර ඇත.

MLIR/Mojo is designed for this future.

Blockchain, Mining, and ASICs

Blockchain, Mining සහ ASIC

Bitcoin වැනි Proof-of-Work Blockchains විශාල hashing බලයක් අවශ්ය වේ.

ඉලක්කය "nonce" සොයා ගැනීමයි, අනෙකුත් දත්ත සමඟ hashed වූ විට, යම් ඉලක්කයක් යටතේ ප්රතිඵලයක් නිෂ්පාදනය කරයි.

This is a brute-force search, perfect for parallel hardware.

ආරම්භයේ දී, ගබඩාකරුවන් CPUs භාවිතා කර, පසුව GPUs ඔවුන්ගේ විශිෂ්ට අනුකූලතාව සඳහා.

SHA-256 මිනීමරුවන් සඳහා CUDA කේතය අඩු මට්ටමක වන අතර එය bitwise සහ integer ක්රියාකාරකම් වලට අවධානය යොමු කරයි.

කෙසේ වෙතත්, SHA-256 වැනි ස්ථාවර, වෙනස් නොවන ඇල්ගාටරි සඳහා, අවසාන උපාංගය ASIC වේ.

However, for a stable, unchanging algorithm like SHA-256, the ultimate hardware is an ASIC.

ASIC (Application-Specific Integrated Circuit) යනු එක් අරමුණක් සඳහා සැලසුම් කරන ලද චිප් එකක් වන අතර එය උපාංගය තුළ ඇල්ගාටරි ක්රියාත්මක කිරීමයි.

ASIC (Application-Specific Integrated Circuit) යනු එක් අරමුණක් සඳහා සැලසුම් කරන ලද චිප් එකක් වන අතර එය උපාංගය තුළ ඇල්ගාටරි ක්රියාත්මක කිරීමයි.

An SHA-256 ASIC has the hashing logic literally baked into the silicon.

It is thousands of times more power-efficient than a GPU for that one task.

This is where the CUDA story ends, but the MLIR/Mojo story gets even more interesting.

මෙහිදී CUDA කතාව අවසන් වේ, නමුත් MLIR / Mojo කතාව තව තවත් වැදගත් වේ.

චිප් නිර්මාණය කිරීමේ ක්රියාවලිය High-Level Synthesis (HLS) ලෙස හැඳින්වේ.

HLS මෙවලම් ඉහල මට්ටමේ විස්තරයක් පරිවර්තනය කිරීමට අඩු මට්ටමේ උපාංගික විස්තරය භාෂාව (Verilog හෝ VHDL වැනි) චිප් නිෂ්පාදනය කිරීමට භාවිතා වේ.

MLIR, through projects like CIRCT (Circuit IR for Compilers and Tools), is designed to be the backbone of next-generation HLS.

පරිගණකයා Mojo වල hashing algorithm එකක් ලිවිය හැකියි.
GPU මිනීමැරුම් සඳහා, ඔවුන් එය GPU backend භාවිතා කරන්න.
ASIC නිර්මාණය කිරීම සඳහා, ඔවුන් HLS backend භාවිතා කිරීමෙන් නිශ්චිතව එකම Mojo කේතය සකස් කළ හැකිය.
MLIR ව්යුහය ඉහළ මට්ටමේ Mojo ලෝහය Verilog වෙත අඩු කරයි.

එකම Mojo Code එක

මෙය උසස් මට්ටමේ මෘදුකාංග සිට Custom Silicon Design දක්වා මුළු කට්ටලය එකිනෙකාට එකතු කරයි.

එය වේගවත් ප්රොටොටොටොටෙප් කිරීම සහ නව ඇල්ගාටරයන් උපරිම ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී ඵලදායී

කුඩීට ඒකට උත්තරයක් නැහැ.

කුඩීට ඒකට උත්තරයක් නැහැ.

It is a software-only solution for a single vendor's programmable hardware.

Neuromorphic Computing and Sparse Data

Neuromorphic Computing සහ Sparse දත්ත

NVIDIA GPUs යනු SIMT: Single Instruction, Multiple Thread යන අර්ථයයි.

NVIDIA GPUs are masters of SIMT: Single Instruction, Multiple Thread.

මේ නිසා දහස් ගණනක් අඟල් සියල්ලම විවිධ දත්ත මත එකම නියෝග ක් රියාත්මක වන විට (උදාහරණයක් ලෙස, මැට්රික් ප්රමාණය) ඔවුන් ඉතා ඵලදායී වේ.

කෙසේ වෙතත්, ඔවුන් බරපතල විෂය හෝ අසාමාන්ය දත්ත ප්රවේශය සහිත වැඩපිළිවෙළක ඉතා අසාමාන්ය වේ.

ඒකට හේතුව තමයි "Divergence of Thread" කියන එක.

කණ්ඩායමේ අංග ( "warp") if/else ප්රකාශයේ විවිධ අංග ගන්නේ නම්, හෘදයාංගය අංග දෙකම අඛණ්ඩව ක්රියාත්මක කළ යුතුය, අඛණ්ඩ මාර්ගයේ අංගයන් සරලවම වහල් කළ යුතුය.

If threads in a group (a "warp") take different branches of an if/else statement, the hardware must execute both paths serially, with threads in the inactive path simply turned off.

both

මෙය බොහෝ වැදගත් ගැටළු සඳහා කාර්ය සාධනය මරයි.

Neuromorphic Computing:

This is a brain-inspired computing paradigm.

Neuromorphic chips, like Intel's Loihi, are not based on clocks and dense matrix math.

ඔවුන්ට සිදුවීම් ආකර්ෂණීයයි.

They are event-driven.

"Neurons" fire a "spike" only when their input potential crosses a threshold.

These spikes travel to other "synapses," which may then cause other neurons to fire.

මෙය අතිශයින් දුර්වල, කාර්යක්ෂම හා අසික්රෝනීය ක් රියාවලියකි.

GPU මත මෙය අනුකූල කිරීමට උත්සාහ කිරීම ස්ථාවර thread divergence නිසා භයානකව අසාමාන්යයි.

Trying to simulate this on a GPU is horrifically inefficient due to constant thread divergence.

MLIR මේ සඳහා පරිපූර්ණ විසඳුම වේ.

MLIR මේ සඳහා පරිපූර්ණ විසඳුම වේ.

MLIR මේ සඳහා පරිපූර්ණ විසඳුම වේ.

A "neuromorphic dialect" can be created within MLIR.
මෙම භාෂාව Spike, Synapse, NeuronUpdate සඳහා පළමු පන්තියේ ක්රියාකාරකම් ඇති කරයි.
සංවර්ධකයා Mojo හි neuromorphic algorithm ලිවීමට මෙම ඉහළ මට්ටමේ සංකල්ප භාවිතා කළ හැකිය.
Loihi වැනි සුවිශේෂී neuromorphic චිප් සඳහා backend සහිත MLIR පරිගණක, මෙම සංකල්පය චිප්ගේ ස්වභාවික, සිදුවීම් පදනම් වූ නියෝගයට පරිවර්තනය කරයි.

මෙය පරිගණකයේ සම්පූර්ණයෙන්ම අසාමාන්ය ආකාරයට ප්රවාහන, ඉහළ මට්ටමක වැඩසටහන් ආකෘතිය සඳහා ඉඩ සලසයි.

CUDA ආකෘතිය මේ ක්ෂේත් රයේ අදාළ නැත.

The CUDA model is not relevant in this domain.

Sparse and Graph Data:

Sparse and Graph Data:

බොහෝ සැබෑ ලෝක ප්රශ්න අඩංගු දත්ත ඇතුළත්: සමාජ ජාල, නිර්දේශ යන්ත්ර, හා විද්යාත්මක සමුදායන්.

මේවා තෘප්තිමත් මාතෘකාවක් ලෙස ඉදිරිපත් කිරීම නාස්තිකාරයෙකි.

මේවා තෘප්තිමත් මාතෘකාවක් ලෙස ඉදිරිපත් කිරීම නාස්තිකාරයෙකි.

GPUs මත ඒවා පරිශීලක කිරීම irregular memory access patterns වලට හේතු වෙනවා, එය GPU හි memory coalescing optimizations හා performance cripples පරාජය කරයි.

Again, MLIR provides the answer.

"graph dialect" හෝ "sparse tensor dialect" මෙම දත්ත ව්යුහයන් ස්වභාවිකව ප්රදර්ශනය කළ හැකිය.
ඉන්පසු පරිවර්තනයකයා අර්බුදයට ප් රතිකාර කිරීම සඳහා විශේෂිත පරිශීලකයන් භාවිතා කළ හැකිය.
උදාහරණයක් ලෙස, එය මතකය ස්ථානය වැඩි දියුණු කිරීම සඳහා නඩත්තු පරිවර්තනය කළ හැකිය හෝ සංකීර්ණ ගබඩා ආකෘති භාවිතා කළ හැකිය.

මෙය Mojo හි ලිඛිත උසස් මට්ටමක ආකෘතිය ඕනෑම උපාංගයක දුර්වල දත්ත සඳහා ඵලදායීව සකස් කිරීමට ඉඩ සලසයි.

This allows a high-level algorithm written in Mojo to be efficiently compiled for sparse data on any hardware.

අද එය අතිශයින්ම අමාරු දෙයක්.

අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන්

අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන් අසාමාන් යයෙන්

Quantum Computing Simulation

Quantum Computing Simulation සකස් කිරීම

Simulating a quantum computer on a classical computer is essential for developing and testing quantum algorithms.

වඩාත් ප් රයෝජනවත් ක් රමය වන්නේ state vector simulation ය.

N-qubit ප් රභේද පද්ධතියක තත්ත්වය 2^N සංකීර්ණ සංඛ් යාවක් වන වක්ටර් විසින් නියෝජනය කරනු ලැබේ.

50 කබයිට් සඳහා පමණක්, මෙම ෙබයිට් 2^50 (කැඩ්රයිලියන් එකකට වඩා වැඩි) අමුද් රව් ය, මතකය petabytes අවශ්ය වේ.

For just 50 qubits, this vector has 2^50 (over a quadrillion) elements, requiring petabytes of memory.

කොන්ජිම ආකෘතිය යනු “Gates” යන සංකේතයකි.

සෑම ගුවන්තොටුපළක්ම ඉතා විශාල, ඉතා දුර්වල මැට්රික් සමඟ විශාල තත්ව වකවානුවකට සමාන වේ.

මෙය පරිගණක ආකර්ෂණීය හා මතකය-බෑන්ඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්බෙඩ්

NVIDIA එහි cuQuantum පුස්තකාලය සමඟ මෙහි විශාල ආයෝජනය කර ඇත.

cuQuantum NVIDIA GPUs මත ඉතා වේගවත් වේ, නමුත් එය සම්ප්රදායික CUDA සීමා:

Vendor Lock-In: ඔබගේ ප්රවණතා සකසා ගැනීම NVIDIA උපාංගයට සම්බන්ධ වේ.
Low-Level Optimization: The compiler sees only matrix-vector multiplications.
කිසිදු Domain Advantage: එය ප්රමාණ විද්යාව සඳහා පහසුකම් නොමැත, LLVM මත පදනම් වේ (සංමාන්ය විෂය).

The MLIR/Mojo Advantage for Quantum Simulation:

Quantum Simulation සඳහා MLIR / Mojo වාසි:

MLIR ප්රවේශය පරිගණකයේ වඩාත් උසස් මට්ටමක බුද්ධිමය හැකියාව ලබා දෙයි.

“Quantum Dialect” නම් MLIR වල තේරුම් ගත හැක.
මෙම භාෂාව ආකෘති ලෙස දොරවල් ප්රදර්ශනය නොකරනු ඇත; එය ඔවුන්ගේ ප්රභේදයන් ලෙස ඔවුන් ප්රදර්ශනය කරනු ඇත: Hadamard, CNOT, Toffoli.
සංවර්ධකයා මෙම උසස් මට්ටමේ ඔක්කොම භාවිතා කරමින් Mojo හි ඔවුන්ගේ ප් රවර්ගය ලිව්වා.
එවිට MLIR පරිවර්තකය ඕනෑම මැට්රිස් උපකරණය කිරීමට පෙර ප් රභේද සංකීර්ණ කිරීමක් සිදු කළ හැකිය.The MLIR compiler can then perform quantum-specific optimizations before any matrices are even generated.

quantum-specific optimizations ප් රතිඵල

උදාහරණයක් ලෙස, පරිගණකයා දන්නේ Hadamard ගබඩාව (H) දෙකකට වරක් අනුගමනය කිරීම හඳුනාගැනීමේ ක්රියාවලිය වන අතර එය සම්පූර්ණයෙන්ම ඉවත් කළ හැකි බවය.

එය නිශ්චිත දොරටු සංකේතයන් එක්, වඩාත් ඵලදායී දොරටුවකට "සංස්කරණය" කළ හැකි බව දැන ගනී.

උදාහරණයක් ලෙස, පරිගණකයා දන්නේ Hadamard ගබඩාව (H) දෙකකට වරක් අනුගමනය කිරීම හඳුනාගැනීමේ ක්රියාවලිය වන අතර එය සම්පූර්ණයෙන්ම ඉවත් කළ හැකි බවය.

එය නිශ්චිත දොරටු සංකේතයන් එක්, වඩාත් ඵලදායී දොරටුවකට "සංස්කරණය" කළ හැකි බව දැන ගනී.

මෙය CUDA පරිගණකයට නොපෙනෙන සම්පූර්ණ පරිගණක වර්ගය වන අතර, එය LLVM ෙම ස්තූතිවන්තව සාමාන් ය මාට් රිස් පමණක් දකිනවා.

This is an entire class of optimization that is invisible to the CUDA compiler, which only sees generic matrices, thanks to LLVM.

මෙම උසස් මට්ටමක ඇල්ජිබ්රික සරල කිරීම් සිදු කිරීමෙන් පසු, MLIR පරිගණක පසුව ඉලක්ක උපාංගය සඳහා දුර්වල මැට්රික් ක්රියාකාරකම් සඳහා පහත සංකීර්ණ චක්රයක් බවට අඩු කරයි.

මේ සියල්ල MLIR මත ඉදිකිරී ඇති නිසා, Mojo හි ලියන එකම උසස් මට්ටමක ප් රභූ චක් රයක් NVIDIA GPU, AMD GPU, හෝ CPU cluster මත ක්රියාත්මක කර ගත හැකිය.

Because this is all built on MLIR, the same high-level quantum circuit written in Mojo could be compiled to run on an NVIDIA GPU, an AMD GPU, or a CPU cluster.

මෙය වඩාත් උසස් කාර්ය සාධනය (මහත් පරිගණකයක් නිසා) සහ සම්පූර්ණ මෘදුකාංග නිදහස ලබා දෙයි.

Nvidia is investing heavily in quantum simulation hardware and the software stack.

නමුත් එහි CUDA-Q වේදිකාව තවමත් LLVM මත පදනම් වේ.

MLIR මත පදනම් වූ Mojo පමණක් නොව උසස් පරිගණකයක් ලබා ගත හැකිය - එය වඩාත් සරල වැඩසටහනක් ලබා දෙනවා.

MLIR-based Mojo can not just offer advanced optimization - it also offers simpler programming.

Final Verdict: Today vs. The Inevitable Future

අවසාන විනිශ්චය: අද vs. අනවශ් ය අනාගතය

The Verdict Today (2025):

අද තීන්දුව (2025):

කන්දේ රජුයි, කන්දේ රජුයි.
එහි ප්රමාණවත් පරිසර පද්ධතිය, පුළුල් පුස්තකාල සහ විශාල සමාජය බලවත් වාසි වේ.
For a team that is already invested in NVIDIA hardware and needs to ship a product immediately, CUDA is the pragmatic choice.
ප ් රජාතන්ත් රවාදයේ දශකයක් තිස්සේ ඇතිවන අඛණ්ඩත්වය ශක්තිමත් ශක්තියකි.
Mojo තවමත් තරුණයි.
එහි පරිසර පද්ධතිය පුදුමාකාර වේගයකින් වර්ධනය වී ඇත, නමුත් එය තවමත් CUDA හි සටන පරීක්ෂා කර ඇති පුස්තකාලයේ පුළුල් ප්රමාණයට ගැලපෙන්නේ නැත.

The Verdict for the Long Run:

දිගු කාලීන විනිශ්චය:

අනාගතය heterogeneous වේ.
මෙය අනුමානයක් නොවේ; එය සත් යයකි.
Custom AI silicon සහ AMD සහ Intel හි නව තරඟයක් වර්ධනය වන අතර, සැපයුම්කරුවන් වහල් කිරීම පිළිගත නොහැකි ව්යාපාරික හා තාක්ෂණික අවදානමක් බවට පත් වී ඇත.
අනාගතයේ ප්රශ්න - දුර්වල දත්ත, neuromorphic AI, blockchain mining, සහ ප්රමාණ පරිගණක - අද වන GPUs හි ස්ථාවර SIMT ආකෘතියට ගැලපෙන ලෙස ගැලපෙන්නේ නැත.
MLIR යනු මෙම ගැටලුව විසඳීමට නිර්මාණය කරන එකම ක්ෂේත් රයට සහාය ලැබෙන ආකෘතියයි.
Google, Apple, Intel, AMD, සහ ARM විසින් එය අනුමත කිරීම පරිවර්තනයක අනාගතයේ මූලික කාර්යය පිළිබඳ පැහැදිලි සංඥාවක් වේ.
Mojo යනු මෙම බලය භාවිතා කිරීමට (තව) නිර්මාණය කරන එකම භාෂාවයි.

මොසෝ :

2 භාෂා ප් රශ්නය විසඳීම
ඵලදායීභාවය සහ ප්රතිඵලදායීභාවය
මුළු MLIR පරිසරයට පිවිසීමේ දොරටුව ලබා දෙයි.

2 භාෂා ප් රශ්නය විසඳීමඵලදායීභාවය සහ ප්රතිඵලදායීභාවයමුළු MLIR පරිසරයට පිවිසීමේ දොරටුව ලබා දෙයි.

CUDA සිට MLIR මත පදනම් වූ ලෝකයට මාරුවීම වේගවත් වනු ඇත, නමුත් එය අනවශ් ය වේ.

එය විවෘත, මෘදුකාංග සකසන අනාගතය සඳහා වෘත්තීය, මෘදුකාංග මධ්යම ආකෘතියෙන් මූලික මාරුවකි.

Mojo හි දුර්වලතා

Mojo තවමත් සංවර්ධනය කර ඇත.
තවම පන්තියකුත් නැහැ.
එහි තුන්වන පාර්ශවයේ පුස්තකාල කිහිපයක් ඇත, නමුත් පුදුමාකාර වේගයකින් වර්ධනය වේ.
එය Python භාවිතා කරන සෑම තැනකම යෙදුම් ඇත - නමුත් එය Python සමඟ වර්ධනය විය යුතුය.
මුළු භාෂාව තවමත් විවෘත මූලාශ්රයක් නොවේ, නමුත් විශේෂඥයන් පවසන්නේ එය ඉක්මනින් වෙනස් වනු ඇත.
එය Windows (තැන් සිට) සහාය නැත.
එය Android, iOS සහ Edge IOT පද්ධති වලට ප්රවාහන අවශ්ය වේ.

නමුත් දිගුකාලීන දී එය ජයග් රාහකයා වනු ඇත?

I believe it will, and developers will be happier with Mojo than CUDA.

ප් රතිඵල

CUDA අද වන විට උසස් තත්ත්වයේ පරිගණකයේ ආකර්ෂණීය අංගයක් ගොඩනැගීය.

CUDA built the impressive palace of today's high-performance computing.

ඒත් ඒක කූඩුවක්.

But it is a cage.

MLIR and Mojo are handing every developer the key to unlock it and build the future on any foundation they choose.

ඒ පදනම යනු MLIR සහ Mojo බවට පත්වීමයි.

ඒ පදනම යනු MLIR සහ Mojo බවට පත්වීමයි.

The simplest reason - the budget.

මෙම Budget

එබැවින්, Nvidia Pivots නොමැති නම්, සහ ඉක්මනින්:

මෙය Nvidia හි ප් රායෝගිකත්වය අවසන් වනු ඇත - ඔවුන් MLIR පවා අනුමත නොකරන්නේ නම්!

This will be the end of the dominance of Nvidia - unless they embrace MLIR as well!

සබැඳි

Official Project Pages

MLIR (Multi-Level Intermediate Representation)
- Text description: The official homepage for the MLIR project, hosted by LLVM. This is the canonical source for documentation, talks, and the project's overall mission statement.
- https://mlir.llvm.org/
Mojo Programming Language
- The official documentation for the Mojo programming language from Modular, the company that created it. This is the primary resource for learning the language.[2]
- https://docs.modular.com/mojo/
NVIDIA CUDA Toolkit
- The official portal from NVIDIA for downloading the CUDA Toolkit, which includes the compilers, libraries, and tools necessary for CUDA development.
- https://developer.nvidia.com/cuda-toolkit
LLVM Compiler Infrastructure Project
- The main homepage for the LLVM project, which provides an overview of the entire ecosystem, including Clang, LLDB, and other sub-projects. MLIR is a part of this larger project.
- https://llvm.org/
Chris Lattner's Homepage
- The personal homepage of Chris Lattner, the creator of LLVM, Clang, Swift, MLIR, and Mojo. It provides his work history and links to his talks and papers, offering direct insight into the creation of these technologies.
- https://nondot.org/sabre/

AI and Attention Mechanism (FlashAttention)

FlashAttention Original Paper (arXiv)
- The original scientific paper, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," which introduced the algorithm. This is the primary source for understanding the technical details and performance benefits.
- https://arxiv.org/abs/2205.14135
FlashAttention-2 Paper (arXiv)
- The follow-up paper describing FlashAttention-2, which details further optimizations for parallelism and work partitioning to achieve even greater speedups on modern GPUs.
- https://arxiv.org/abs/2307.08691
FlashAttention GitHub Repository
- The official GitHub repository containing the source code for the FlashAttention and FlashAttention-2 CUDA kernels.
- https://github.com/Dao-AILab/flash-attention

Quantum Computing Simulation

NVIDIA cuQuantum Official Page
- NVIDIA's official product page for the cuQuantum SDK, outlining its features for accelerating quantum computing simulations on GPUs.
- https://developer.nvidia.com/cuquantum
NVIDIA cuQuantum Documentation
- The detailed technical documentation for the cuQuantum SDK, providing a high-level overview and API references for the libraries.
- https://docs.nvidia.com/cuda/cuquantum/index.html

Specialized Hardware (Neuromorphic & ASICs)

Intel Neuromorphic Computing Overview
- Intel's official overview of their neuromorphic computing research, which discusses the goals of the program and the Loihi research chips.
- https://www.intel.com/content/www/us/en/research/neuromorphic-computing.html
CIRCT (Circuit IR Compilers and Tools) Project
- The official homepage for the CIRCT project, an LLVM/MLIR incubator looking to apply compiler technology to hardware design, including High-Level Synthesis (HLS) for FPGAs and ASICs.
- https://circt.llvm.org/
CIRCT GitHub Repository
- The official GitHub repository for the CIRCT project, containing the source code, dialects, and tools for hardware compiler design.
- https://github.com/llvm/circt

මෙම ලිපිය සඳහා Google AI Studio භාවිතා කරන ලදී.ඔබ එය මෙහි සොයා ගත හැකිය:

https://aistudio.google.com/

මෙම ලිපිය සඳහා Google AI Studio භාවිතා කරන ලදී.ඔබ එය මෙහි සොයා ගත හැකිය:

https://aistudio.google.com/

සියලු ඡායාරූප විසින් නිර්මාණය කර ඇත NightCafe Studio නොමිලේ, පහත ලින්ක් මත ලබා ගත හැකි:

https://creator.nightcafe.studio/

සියලු ඡායාරූප විසින් නිර්මාණය කර ඇත NightCafe Studio නොමිලේ, පහත ලින්ක් මත ලබා ගත හැකි:

https://creator.nightcafe.studio/

මෙම නව භාෂාව NVIDIA හි GPU Monopoly මරන්න පුළුවන්

දිග වැඩියි; කියවීමට

Multi-Level Intermediate Representation (MLIR) සහ Mojo වැඩසටහන් භාෂාව

1. CUDA: The Powerful, Proprietary Incumbent

CUDA's Strengths:

CUDA's Fatal Flaw: The Cage

The Two-Language Problem: A Major Bottleneck in AI and Scientific Computing (විශ්ව භාෂා දෙකක ගැටලුව: AI සහ විද්යාත්මක පරිගණකයේ ප් රධාන බිත්ති)

වැඩසටහන් සංකීර්ණත්වය:

2. LLVM: The Foundation and Its "Semantic Gap”

3. MLIR: The Universal Translator for Hardware

4. Mojo: The User-Friendly Face of MLIR's Power

Mojo's Key Features:

Python වල Superset එක

සැබෑ පද්ධති වැඩසටහන විශේෂාංග:

පළමු පන්තියේ MLIR ඇතුළත් කිරීම:

Full Code Examples and Analysis

Example 1: Matrix Multiplication

The Full CUDA Implementation

Analysis of the CUDA Code:

The Full Mojo Implementation

The Mojo Approach is Far Superior

වැඩසටහන සහ අවධානය:

ප්රතිඵල සමඟ abstraction:

Portability (The Ultimate Advantage):

Example 2: Gen AI and the Transformer Attention Mechanism

The CUDA Implementation (Conceptual FlashAttention)

Analysis of the CUDA/FlashAttention Approach:

The Conceptual Mojo Implementation

Mojo is Game-Changing for AI:

Separation of Concerns:

Research Velocity and Maintainability:

Hardware Freedom:(අපේ වැදගත්ම දේ )

Specialized Hardware and Future Domains

Blockchain, Mining, and ASICs

Neuromorphic Computing and Sparse Data

Neuromorphic Computing:

Sparse and Graph Data:

Quantum Computing Simulation

The MLIR/Mojo Advantage for Quantum Simulation:

Final Verdict: Today vs. The Inevitable Future

The Verdict Today (2025):

The Verdict for the Long Run:

Mojo හි දුර්වලතා

ප් රතිඵල

සබැඳි

Official Project Pages

AI and Attention Mechanism (FlashAttention)

Quantum Computing Simulation

Specialized Hardware (Neuromorphic & ASICs)

About Author

ටැග් එල්ලන්න

මෙම ලිපිය ඉදිරිපත් කරන ලදී...

අදාළ කථා

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps