NVIDIA AI Simply Launched cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels On to PTX

Step 01 of 09 · Stipulations

What You Want Earlier than You Begin

cuda-oxide has particular model necessities for every dependency. Earlier than putting in something, confirm your system meets all of those. The challenge is at the moment Linux-only (examined on Ubuntu 24.04).

Linux (Ubuntu 24.04)
Rust nightly
CUDA Toolkit 12.x+
LLVM 21+
Clang 21 / libclang-common-21-dev
Git

ⓘ Why LLVM 21?
Easy kernels may fit on LLVM 20, however something concentrating on Hopper or Blackwell — TMA, tcgen05, WGMMA — requires llc from LLVM 21 or later. This can be a arduous requirement, not a suggestion.

Test your present CUDA model to verify compatibility:

nvcc –version

Step 02 of 09 · Set up Rust Nightly

Set Up the Rust Nightly Toolchain

cuda-oxide requires Rust nightly with two further elements: rust-src and rustc-dev. The toolchain is pinned to nightly-2026-04-03 through rust-toolchain.toml within the repository — will probably be put in routinely if you first run a construct contained in the repo.

If you must set up it manually:

# Set up the pinned nightly toolchain
rustup toolchain set up nightly-2026-04-03

# Add required elements
rustup part add rust-src rustc-dev
–toolchain nightly-2026-04-03

# Verify the toolchain is lively
rustup present

ⓘ Why these elements?
rustc-dev exposes the interior compiler APIs that the customized codegen backend hooks into. rust-src is required so the compiler can discover and compile its personal normal library sources for the machine goal.

Step 03 of 09 · Set up LLVM 21

Set up LLVM 21 with the NVPTX Backend

The cuda-oxide pipeline emits textual LLVM IR (.ll recordsdata) and palms them to the exterior llc binary to supply PTX. You want LLVM 21 or later with the NVPTX backend enabled.

# Ubuntu/Debian
sudo apt set up llvm-21

# Confirm the NVPTX backend is current
llc-21 –version | grep nvptx

The pipeline auto-discovers llc-22 and llc-21 in your PATH in that order. To pin a particular binary, set the setting variable:

# Pin to a particular llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Widespread Failure
If NVPTX doesn’t seem within the output of llc-21 –version, your LLVM construct was compiled with out the NVPTX goal. Set up from the official LLVM apt repository quite than your distro’s default packages, which can omit GPU backends.

Step 04 of 09 · Set up Clang

Set up Clang 21 for the cuda-bindings Crate

The cuda-bindings crate makes use of bindgen to generate FFI bindings to cuda.h at construct time. bindgen wants libclang — and particularly, it wants Clang’s personal useful resource listing (which incorporates stddef.h). A naked libclang1-* runtime bundle is not sufficient.

# Set up the total clang-21 bundle (consists of useful resource headers)
sudo apt set up clang-21

# Alternatively, the -dev header bundle additionally works
sudo apt set up libclang-common-21-dev

⚠ Symptom of Lacking Clang
If you happen to solely set up the runtime however not the headers, the host construct will fail with a cryptic ‘stddef.h’ file not discovered error throughout bindgen. Run cargo oxide physician within the subsequent step to catch this earlier than making an attempt a construct.

Step 05 of 09 · Set up cargo-oxide

Clone the Repo and Set up cargo-oxide

cargo-oxide is a Cargo subcommand that drives all the construct pipeline — working cargo oxide construct, cargo oxide run, cargo oxide debug, and cargo oxide pipeline.

Contained in the repo (for making an attempt examples):

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

# cargo oxide works out of the field through a workspace alias
cargo oxide run vecadd

Outdoors the repo (in your personal initiatives):

# Set up globally from the git supply
cargo set up
–git https://github.com/NVlabs/cuda-oxide.git
cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then confirm all conditions are in place with the built-in well being test:

cargo oxide physician

ⓘ What physician checks
It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM model and NVPTX assist, Clang/libclang headers, and the codegen backend binary. Repair any pink gadgets earlier than continuing.

Step 06 of 09 · Run Your First Kernel

Construct and Run the vecadd Instance

The canonical first instance is vecadd — a vector addition kernel that provides two arrays of 1,024 f32 values on the GPU and verifies the outcome on the host.

# Construct and run end-to-end
cargo oxide run vecadd

If the whole lot is configured appropriately, you will note:

✓ SUCCESS: All 1024 parts appropriate!

To see the total compilation pipeline — from Rust MIR by every Pliron dialect all the way down to PTX — run:

# Print the total Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX hint
cargo oxide pipeline vecadd

To debug with cuda-gdb:

cargo oxide debug vecadd –tui

ⓘ Output artifacts
A profitable construct produces two recordsdata: goal/debug/vecadd (the host binary) and goal/debug/vecadd.ptx (the machine code). The host binary masses the PTX file through the CUDA driver at runtime.

Step 07 of 09 · Write a Kernel

Writing Your Personal #[kernel] Perform

A kernel perform is annotated with #[kernel]. Use DisjointSlice for mutable outputs and &[T] for read-only inputs. Entry the thread’s distinctive {hardware} index with thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 security: race-free by building, no `unsafe` wanted.
// DisjointSlice::get_mut() solely accepts a ThreadIndex —
// a hardware-derived opaque kind guaranteeing distinctive writes per thread.
#[kernel]
pub fn scale(enter: &[f32], issue: f32, mut out: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(elem) = out.get_mut(idx) {
*elem = enter[idx.get()] * issue;
}
}

ⓘ Tier 1 Security — the way it works
ThreadIndex is an opaque newtype round usize that may solely be created from {hardware} built-in registers (threadIdx, blockIdx, blockDim). Since every thread will get a novel worth, and DisjointSlice::get_mut() solely accepts a ThreadIndex, writes are race-free by building — no unsafe wherever within the kernel.

Step 08 of 09 · Launch from Host

Launching the Kernel from Host Code

Host and machine code dwell in the identical .rs file. The host facet makes use of CudaContext, DeviceBuffer, and the cuda_launch! macro to handle GPU reminiscence and dispatch.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn essential() {
// Initialize GPU context on machine 0
let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();
let module = load_kernel_module(&ctx, “scale_example”).unwrap();

// Add enter information to GPU reminiscence
let information: Vec<f32> = (0..1024).map(|i| i as f32).accumulate();
let enter = DeviceBuffer::from_host(&stream, &information).unwrap();
let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

// Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
cuda_launch! {
kernel: scale,
stream: stream,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(input), 2.5f32, slice_mut(output)]
}.unwrap();

// Obtain outcome again to host
let outcome = output.to_host_vec(&stream).unwrap();
assert!((outcome[1] – 2.5).abs() < 1e-5);
println!(“✓ Kernel ran efficiently!”);
}

ⓘ What cuda_launch! does
It scalarizes the argument listing — flattening slices, scalars, and captured closures — into PTX kernel parameters and dispatches the kernel on the given stream. No guide argument marshalling is required.

Step 09 of 09 · Subsequent Steps

What to Discover Subsequent

You will have a working cuda-oxide setup. Listed below are the high-value paths ahead, ordered by complexity:

Generic kernels with monomorphization — attempt the generic instance (cargo oxide run generic) to see how fn scale compiles to separate PTX kernels per kind.
Closures with captures — the host_closure instance exhibits how a transfer |x: f32| x * issue closure is scalarized and handed as PTX kernel parameters routinely.
Async GPU execution — cuda_launch_async! returns a lazy DeviceOperation that executes on .sync() or .await. See the async_mlp and async_vecadd examples.
Shared reminiscence and warp intrinsics — these require scoped unsafe blocks with documented security contracts. See Tier 2 within the security mannequin documentation.
GEMM at Velocity-of-Mild — the gemm_sol instance achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) utilizing cta_group::2, CLC, and a 4-stage pipeline.
Blackwell tensor cores — the tcgen05 instance targets sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.

ⓘ Identified Limitation in v0.1.0
index_2d(stride) is documented as at the moment unsound — if threads in the identical kernel use completely different stride values, two threads can get &mut T to the identical factor with no unsafe in sight. Till the repair lands (lifting stride into a sort parameter), bind stride to a single let binding and reuse it at each name web site.

Full documentation: nvlabs.github.io/cuda-oxide · Supply: github.com/NVlabs/cuda-oxide

What's Hot

Punisher, Wrexham, We Bury The Lifeless: What’s New to Watch on Disney+ and Hulu the Week of Might 8, 2026

Resident Evil Requiem Will get A New Leon Should Die Ceaselessly Mode, Out At present

How A Radio Fanatic Halted Taiwan’s Excessive-Pace Trains

A Coding Implementation to Recuperate Hidden Malware IOCs with FLARE-FLOSS Past Basic Strings Evaluation

NVIDIA AI Releases Star Elastic: One Checkpoint that Comprises 30B, 23B, and 12B Reasoning Fashions with Zero-Shot Slicing

Scientists Studied 906 Mafia Marriages and Discovered One thing Shocking

Understanding AI Agent Reminiscence Patterns: A Information with LangGraph

9 Finest AI Instruments for Spec-Pushed Improvement in 2026: Kiro, BMAD, GSD, and Extra Evaluate

The Authorities Simply Launched a Batch of UFO Information: The place Are the Aliens?

Punisher, Wrexham, We Bury The Lifeless: What’s New to Watch on Disney+ and Hulu the Week of Might 8, 2026

Resident Evil Requiem Will get A New Leon Should Die Ceaselessly Mode, Out At present

How A Radio Fanatic Halted Taiwan’s Excessive-Pace Trains

Punisher, Wrexham, We Bury The Lifeless: What’s New to Watch on Disney+ and Hulu the Week of Might 8, 2026

Resident Evil Requiem Will get A New Leon Should Die Ceaselessly Mode, Out At present

How A Radio Fanatic Halted Taiwan’s Excessive-Pace Trains

Usefull link

categories

What's Hot

What You Want Earlier than You Begin

Set Up the Rust Nightly Toolchain

Set up LLVM 21 with the NVPTX Backend

Set up Clang 21 for the cuda-bindings Crate

Clone the Repo and Set up cargo-oxide

Construct and Run the vecadd Instance

Writing Your Personal #[kernel] Perform

Launching the Kernel from Host Code

What to Discover Subsequent

Related Posts

Usefull link

categories