Building Galacto — a GPU N-body galaxy in Rust and WebGPU

01The stack, and the shape of the loop

The whole thing is one Rust crate compiled to WebAssembly with wasm-pack build --target web, talking to the GPU through wgpu (the Rust implementation of the WebGPU standard) with shaders written in WGSL. The dependency list is deliberately small; this is the core of it:

tomlCargo.toml — abridged

[dependencies]
wgpu = "29.0"              # WebGPU: compute + render
wasm-bindgen = "0.2"       # the JS <-> Rust boundary
wasm-bindgen-futures = "0.4"
web-sys = { version = "0.3", features = [
    "HtmlCanvasElement",
    "AudioContext",
    # DOM, input events, and the Web Audio nodes the app uses.
] }
bytemuck = { version = "1.0", features = ["derive"] }  # zero-copy GPU uploads
cgmath = "0.18"             # camera math
rand = { version = "0.10", default-features = false, features = ["std", "std_rng"] }

The architecture rests on one rule: body state lives only on the GPU. Positions and velocities sit in GPU storage buffers and are never copied back to the CPU each frame (the one exception is a tiny, throttled, asynchronous reduction that drives the audio). A single requestAnimationFrame callback updates input and audio, then records the GPU work for that frame. The compute passes and the draw read and write the same GPU-resident particle buffer:

rustsrc/lib.rs — the per-frame driver

fn animation_frame(time: f32) {
    let app_state = APP_STATE.with(|cell| cell.borrow().clone());
    if let Some(app_state) = app_state {
        let mut app = app_state.borrow_mut();
        app.update(time);        // input + fixed-step accumulator
        app.update_audio();      // consume the latest async core readback
        if let Err(e) = app.render() {
            console_log!("Render error: {:?}", e);
        }
    }
    request_animation_frame();  // schedule the next frame
}

02A body, laid out for the GPU

Each body is two vec4s: position + mass, and velocity + a colour tint. Packing as vec4 rather than vec3 is not laziness — it is the single most common WebGPU footgun. WGSL rounds a vec3<f32> up to a 16-byte stride in storage, so a vec3-based struct silently disagrees with its Rust mirror. Using vec4 makes the layout unambiguous and identical on both sides:

rustsrc/simulation.rs

#[repr(C)]
#[derive(Copy, Clone, Debug, Pod, Zeroable)]
pub struct Particle {
    pub pos_mass: [f32; 4],  // xyz = position, w = mass
    pub vel: [f32; 4],       // xyz = velocity, w = colour tint
}

wgslsrc/shaders/update.wgsl — the exact mirror

struct Particle {
    pos_mass: vec4<f32>,  // xyz = position, w = mass
    vel: vec4<f32>,       // xyz = velocity, w = colour tint
}

#[repr(C)] fixes the field order and bytemuck's Pod/Zeroable let the struct be cast to bytes with zero copying for upload. And rather than hope the two layouts match, unit tests assert the buffer contract. A future field change has to update both sides, or it fails before it reaches a browser:

rustsrc/simulation.rs — a test that guards the layout

#[test]
fn buffer_layout_matches_wgsl() {
    // Mirrored byte-for-byte by the structs in update.wgsl.
    assert_eq!(std::mem::size_of::<Particle>(), 32);
    assert_eq!(std::mem::size_of::<SimulationParams>(), 32);
}

03The heart: all-pairs gravity, tiled

The expensive kernel is the O(N²) gravity sum: every body accumulates the pull of every other. Done naïvely, each of the N threads reads all N bodies from global memory — N² global reads, and the kernel is starved on memory bandwidth, not arithmetic.

The fix is the classic GPU N-body tiling pattern. The workgroup walks the body array in tiles: all 256 threads cooperatively stage one tile into fast var<workgroup> shared memory, hit a barrier, then every thread accumulates that whole tile's pull on its own body from shared memory before moving to the next tile. Global memory is touched once per workgroup per tile, not once per body per thread:

wgslsrc/shaders/update.wgsl — compute_accel (core)

// Tile size == workgroup size; particle_count is clamped to a multiple of it,
// so the tile loop never needs a tail guard.
const TILE: u32 = 256u;
var<workgroup> shared_pm: array<vec4<f32>, 256>;

@compute @workgroup_size(256)
fn compute_accel(
    @builtin(global_invocation_id) gid: vec3<u32>,
    @builtin(local_invocation_index) lidx: u32,
) {
    let i = gid.x;
    let pi = particles[i].pos_mass.xyz;
    let soft2 = params.softening * params.softening;
    let n = params.particle_count;

    var a = vec3<f32>(0.0);
    var base = 0u;
    loop {
        if base >= n { break; }
        // Stage one tile into workgroup memory, in parallel...
        shared_pm[lidx] = particles[base + lidx].pos_mass;
        workgroupBarrier();
        // ...then sum that tile's pull from fast shared memory.
        for (var j = 0u; j < TILE; j = j + 1u) {
            let pmj = shared_pm[j];
            let d = pmj.xyz - pi;
            let r2 = dot(d, d) + soft2;       // Plummer softening
            let inv = inverseSqrt(r2);
            // Self term (d == 0) contributes nothing — no branch needed.
            a += params.g * pmj.w * d * (inv * inv * inv);
        }
        workgroupBarrier();
        base = base + TILE;
    }
    // Static halo force + Chandrasekhar dynamical friction are added here.
    accel[i] = vec4<f32>(a, 0.0);
}

Why it's quick to read, fast to run

The softened denominator means the self-term (where d == 0) is finite and adds zero, so the inner loop needs no if i != j branch. Two workgroupBarrier()s bracket each tile: one so every thread sees the staged tile, one so no thread overwrites it while others still read.

04A symplectic leapfrog, as three compute passes

The integrator is split across three separate compute passes per fixed step. That separation isn't stylistic — it's how you avoid a read/write hazard. The gravity pass must read positions that the drift pass has finished writing; making them distinct passes lets the GPU order all of pass n's writes before pass n+1's reads:

rustsrc/simulation.rs — one drift–kick–drift step

pub fn compute_pass(&self, encoder: &mut wgpu::CommandEncoder) {
    self.dispatch(encoder, &self.drift_pipeline, "Drift Pass");  // x += v·dt/2
    self.dispatch(encoder, &self.accel_pipeline, "Accel Pass");  // gravity @ midpoint
    self.dispatch(encoder, &self.kick_pipeline,  "Kick Pass");   // v += a·dt; x += v·dt/2
}

fn dispatch(&self, enc: &mut wgpu::CommandEncoder, pipe: &wgpu::ComputePipeline, label: &str) {
    let mut pass = enc.begin_compute_pass(&wgpu::ComputePassDescriptor {
        label: Some(label), timestamp_writes: None,
    });
    pass.set_pipeline(pipe);
    pass.set_bind_group(0, &self.compute_bind_group, &[]);
    pass.dispatch_workgroups(self.count / WORKGROUP_SIZE, 1, 1);
}

The drift and kick kernels themselves are tiny — and note the drift carries the colour tint through untouched:

wgslsrc/shaders/update.wgsl — the leapfrog halves

@compute @workgroup_size(256)
fn drift_half(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if i >= params.particle_count { return; }
    let p = particles[i];
    let x = p.pos_mass.xyz + p.vel.xyz * (0.5 * params.dt);
    particles[i] = Particle(vec4<f32>(x, p.pos_mass.w), p.vel);
}

@compute @workgroup_size(256)
fn kick_drift_half(@builtin(global_invocation_id) gid: vec3<u32>) {
    let i = gid.x;
    if i >= params.particle_count { return; }
    let p = particles[i];
    let v = p.vel.xyz + accel[i].xyz * params.dt;        // full kick
    let x = p.pos_mass.xyz + v * (0.5 * params.dt);     // second half-drift
    particles[i] = Particle(vec4<f32>(x, p.pos_mass.w), vec4<f32>(v, p.vel.w));
}

Drifting first is a nice trick: the very first kernel each step reads no acceleration, so a freshly-seeded galaxy needs no primed accel buffer — the next pass simply evaluates gravity at the drifted midpoint.

05Frame-rate-independent time

Physics steps at a fixed dt (1/60), decoupled from the display via an accumulator: bank the real elapsed time, scaled by the speed slider, then spend it in whole fixed steps. Refresh rate changes how many frames you see, not the size of a physics step.

rustsrc/lib.rs — the accumulator

// Bank real elapsed time (clamped against a stall), scaled by the speed knob.
self.accumulator += frame_dt.clamp(0.0, MAX_FRAME_DT) * self.speed;

// Spend it in whole FIXED_DT steps. Cap catch-up after a stall because each
// substep is a full O(N²) gravity pass.
let mut steps = (self.accumulator / simulation::FIXED_DT) as u32;
if steps > MAX_SUBSTEPS {
    steps = MAX_SUBSTEPS;
    self.accumulator = 0.0;
} else {
    self.accumulator -= steps as f32 * simulation::FIXED_DT;
}
self.steps_this_frame = steps;
self.sim_time += steps as f32 * simulation::FIXED_DT;

The seed is deterministic, but the long-run trajectory is not a cross-GPU bit-for-bit promise: the all-pairs sum is massively parallel floating point, and tiny ordering differences eventually matter in a chaotic system.

06Rendering: instanced billboards + bloom

Each star is a camera-facing quad, drawn instanced — 4 vertices × N instances — with the vertex shader reading the same particle buffer the compute passes just wrote. The trick for a constant on-screen size regardless of depth is to offset in clip space, scaling by clip.w and dividing x by the aspect ratio to keep quads square:

wgslsrc/shaders/render.wgsl — billboard vertex (core)

let particle = particles[instance_index];
var clip = camera.transform * vec4<f32>(particle.pos_mass.xyz, 1.0);

// Offset in clip space so the billboard is a constant size on screen,
// independent of depth; divide x by aspect to keep it square.
clip.x += corner.x * camera.size * clip.w / camera.aspect;
clip.y += corner.y * camera.size * clip.w;

The fragment shader draws a soft radial falloff, and additive blending sums overlapping stars so dense regions build into a bright core — then a bright-pass / separable-blur / tonemap pass adds the bloom that gives the galaxy its glow.

07The wasm-bindgen boundary

The page owns a single AppState behind a thread_local cell. WebAssembly in the browser is single-threaded, so a RefCell<Option<Rc<...>>> is enough: no locks, no Send bounds to satisfy. Every UI control is just a small #[wasm_bindgen] free function the JavaScript calls:

rustsrc/lib.rs — exported controls

#[wasm_bindgen]
pub fn set_halo(halo_v0: f32) {
    let halo_v0 = halo_v0.clamp(0.0, 400.0);
    APP_STATE.with(|cell| {
        if let Some(app) = cell.borrow().as_ref() {
            app.borrow_mut().set_halo(halo_v0);  // rewrites the params uniform — live, no re-seed
        }
    });
}

Live knobs (gravity, halo strength) just rewrite a small uniform buffer; seed-time knobs (body count, Toomre Q, gas, bulge) regenerate the initial conditions and re-upload the particle buffer.

08Patterns that paid off

Mind WGSL alignment. vec3 takes a 16-byte slot; pack with vec4 or pad uniforms, and assert the layout in a test (§02). Silent stride mismatches are miserable to debug.
Keep state on the GPU. A per-frame readback stalls the pipeline waiting for the device. The only readback here is asynchronous and single-in-flight: a flag skips it while a previous map is pending, which naturally throttles it to the GPU round-trip rate and keeps it off the hot path.
Separate passes for read/write hazards. Gravity reads what drift wrote — distinct compute passes give you the ordering guarantee for free (§04).
Make the count a multiple of the workgroup size. Then the inner tile loop needs no bounds check, and every dispatch divides evenly.
Fixed timestep + accumulator. The display rate never changes integration accuracy (§05).
Deterministic RNG. A fixed-seed StdRng with getrandom disabled gives repeatable initial conditions and no OS-entropy dependency to configure for wasm.
Cast, don't serialise. bytemuck turns #[repr(C)] structs into byte slices for buffer writes with zero copies.

A WebGPU reality check

WebGPU is broadly available now, but exact support still depends on browser, OS, GPU and driver. Feature-detect navigator.gpu and show a friendly message rather than a blank canvas when it is missing.

09Build & ship

The build is two steps: compile the crate to wasm and copy the static shell alongside it.

bashthe build

# Rust + shaders -> WebAssembly + JS glue, into pkg/
wasm-pack build --target web --release --out-dir pkg --out-name galacto

# Copy the HTML/CSS shell next to it, then stamp ?v=<git-sha> URLs
cp -r static/* pkg/ && node scripts/cache-bust.mjs

The output in pkg/ is a plain static site — no server, no backend — so it deploys to any static host (Galacto runs on Cloudflare Pages). The build stamps the JS and CSS URLs with the git SHA and gives the service worker a per-deploy cache name, so repeat visits are quick, offline launch works, and the next navigation can still see a fresh deploy.

← The physics behind it