01The stack, and the shape of the loop
The whole thing is one Rust crate compiled to WebAssembly with
wasm-pack build --target web, talking to the GPU through
wgpu (the Rust implementation of the WebGPU standard) with shaders
written in WGSL. The dependency list is deliberately small; this is the core of it:
[dependencies]
wgpu = "29.0" # WebGPU: compute + render
wasm-bindgen = "0.2" # the JS <-> Rust boundary
wasm-bindgen-futures = "0.4"
web-sys = { version = "0.3", features = [
"HtmlCanvasElement",
"AudioContext",
# DOM, input events, and the Web Audio nodes the app uses.
] }
bytemuck = { version = "1.0", features = ["derive"] } # zero-copy GPU uploads
cgmath = "0.18" # camera math
rand = { version = "0.10", default-features = false, features = ["std", "std_rng"] }
The architecture rests on one rule: body state lives only on the GPU. Positions
and velocities sit in GPU storage buffers and are never copied back to the CPU each frame (the one
exception is a tiny, throttled, asynchronous reduction that drives the audio). A single
requestAnimationFrame callback updates input and audio, then records the GPU work for
that frame. The compute passes and the draw read and write the same GPU-resident particle buffer:
fn animation_frame(time: f32) {
let app_state = APP_STATE.with(|cell| cell.borrow().clone());
if let Some(app_state) = app_state {
let mut app = app_state.borrow_mut();
app.update(time); // input + fixed-step accumulator
app.update_audio(); // consume the latest async core readback
if let Err(e) = app.render() {
console_log!("Render error: {:?}", e);
}
}
request_animation_frame(); // schedule the next frame
}
02A body, laid out for the GPU
Each body is two vec4s: position + mass, and velocity + a colour tint. Packing as
vec4 rather than vec3 is not laziness — it is the single most common
WebGPU footgun. WGSL rounds a vec3<f32> up to a 16-byte stride in
storage, so a vec3-based struct silently disagrees with its Rust mirror. Using
vec4 makes the layout unambiguous and identical on both sides:
#[repr(C)]
#[derive(Copy, Clone, Debug, Pod, Zeroable)]
pub struct Particle {
pub pos_mass: [f32; 4], // xyz = position, w = mass
pub vel: [f32; 4], // xyz = velocity, w = colour tint
}
struct Particle {
pos_mass: vec4<f32>, // xyz = position, w = mass
vel: vec4<f32>, // xyz = velocity, w = colour tint
}
#[repr(C)] fixes the field order and bytemuck's Pod/Zeroable
let the struct be cast to bytes with zero copying for upload. And rather than hope the two
layouts match, unit tests assert the buffer contract. A future field change has to update both
sides, or it fails before it reaches a browser:
#[test]
fn buffer_layout_matches_wgsl() {
// Mirrored byte-for-byte by the structs in update.wgsl.
assert_eq!(std::mem::size_of::<Particle>(), 32);
assert_eq!(std::mem::size_of::<SimulationParams>(), 32);
}
03The heart: all-pairs gravity, tiled
The expensive kernel is the O(N²) gravity sum: every body accumulates the pull of every other. Done naïvely, each of the N threads reads all N bodies from global memory — N² global reads, and the kernel is starved on memory bandwidth, not arithmetic.
The fix is the classic GPU N-body tiling pattern. The workgroup walks the body array in
tiles: all 256 threads cooperatively stage one tile into fast
var<workgroup> shared memory, hit a barrier, then every thread accumulates that
whole tile's pull on its own body from shared memory before moving to the next tile. Global memory
is touched once per workgroup per tile, not once per body per thread:
// Tile size == workgroup size; particle_count is clamped to a multiple of it,
// so the tile loop never needs a tail guard.
const TILE: u32 = 256u;
var<workgroup> shared_pm: array<vec4<f32>, 256>;
@compute @workgroup_size(256)
fn compute_accel(
@builtin(global_invocation_id) gid: vec3<u32>,
@builtin(local_invocation_index) lidx: u32,
) {
let i = gid.x;
let pi = particles[i].pos_mass.xyz;
let soft2 = params.softening * params.softening;
let n = params.particle_count;
var a = vec3<f32>(0.0);
var base = 0u;
loop {
if base >= n { break; }
// Stage one tile into workgroup memory, in parallel...
shared_pm[lidx] = particles[base + lidx].pos_mass;
workgroupBarrier();
// ...then sum that tile's pull from fast shared memory.
for (var j = 0u; j < TILE; j = j + 1u) {
let pmj = shared_pm[j];
let d = pmj.xyz - pi;
let r2 = dot(d, d) + soft2; // Plummer softening
let inv = inverseSqrt(r2);
// Self term (d == 0) contributes nothing — no branch needed.
a += params.g * pmj.w * d * (inv * inv * inv);
}
workgroupBarrier();
base = base + TILE;
}
// Static halo force + Chandrasekhar dynamical friction are added here.
accel[i] = vec4<f32>(a, 0.0);
}
The softened denominator means the self-term (where d == 0) is finite and adds zero,
so the inner loop needs no if i != j branch.
Two workgroupBarrier()s bracket each tile: one so every thread sees the staged tile,
one so no thread overwrites it while others still read.
04A symplectic leapfrog, as three compute passes
The integrator is split across three separate compute passes per fixed step. That separation isn't stylistic — it's how you avoid a read/write hazard. The gravity pass must read positions that the drift pass has finished writing; making them distinct passes lets the GPU order all of pass n's writes before pass n+1's reads:
pub fn compute_pass(&self, encoder: &mut wgpu::CommandEncoder) {
self.dispatch(encoder, &self.drift_pipeline, "Drift Pass"); // x += v·dt/2
self.dispatch(encoder, &self.accel_pipeline, "Accel Pass"); // gravity @ midpoint
self.dispatch(encoder, &self.kick_pipeline, "Kick Pass"); // v += a·dt; x += v·dt/2
}
fn dispatch(&self, enc: &mut wgpu::CommandEncoder, pipe: &wgpu::ComputePipeline, label: &str) {
let mut pass = enc.begin_compute_pass(&wgpu::ComputePassDescriptor {
label: Some(label), timestamp_writes: None,
});
pass.set_pipeline(pipe);
pass.set_bind_group(0, &self.compute_bind_group, &[]);
pass.dispatch_workgroups(self.count / WORKGROUP_SIZE, 1, 1);
}
The drift and kick kernels themselves are tiny — and note the drift carries the colour tint through untouched:
@compute @workgroup_size(256)
fn drift_half(@builtin(global_invocation_id) gid: vec3<u32>) {
let i = gid.x;
if i >= params.particle_count { return; }
let p = particles[i];
let x = p.pos_mass.xyz + p.vel.xyz * (0.5 * params.dt);
particles[i] = Particle(vec4<f32>(x, p.pos_mass.w), p.vel);
}
@compute @workgroup_size(256)
fn kick_drift_half(@builtin(global_invocation_id) gid: vec3<u32>) {
let i = gid.x;
if i >= params.particle_count { return; }
let p = particles[i];
let v = p.vel.xyz + accel[i].xyz * params.dt; // full kick
let x = p.pos_mass.xyz + v * (0.5 * params.dt); // second half-drift
particles[i] = Particle(vec4<f32>(x, p.pos_mass.w), vec4<f32>(v, p.vel.w));
}
Drifting first is a nice trick: the very first kernel each step reads no acceleration, so a
freshly-seeded galaxy needs no primed accel buffer — the next pass simply evaluates
gravity at the drifted midpoint.
05Frame-rate-independent time
Physics steps at a fixed dt (1/60), decoupled from the display via an
accumulator: bank the real elapsed time, scaled by the speed slider, then spend it
in whole fixed steps. Refresh rate changes how many frames you see, not the size of a physics step.
// Bank real elapsed time (clamped against a stall), scaled by the speed knob.
self.accumulator += frame_dt.clamp(0.0, MAX_FRAME_DT) * self.speed;
// Spend it in whole FIXED_DT steps. Cap catch-up after a stall because each
// substep is a full O(N²) gravity pass.
let mut steps = (self.accumulator / simulation::FIXED_DT) as u32;
if steps > MAX_SUBSTEPS {
steps = MAX_SUBSTEPS;
self.accumulator = 0.0;
} else {
self.accumulator -= steps as f32 * simulation::FIXED_DT;
}
self.steps_this_frame = steps;
self.sim_time += steps as f32 * simulation::FIXED_DT;
The seed is deterministic, but the long-run trajectory is not a cross-GPU bit-for-bit promise: the all-pairs sum is massively parallel floating point, and tiny ordering differences eventually matter in a chaotic system.
06Rendering: instanced billboards + bloom
Each star is a camera-facing quad, drawn instanced — 4 vertices × N instances — with the
vertex shader reading the same particle buffer the compute passes just wrote. The trick for a
constant on-screen size regardless of depth is to offset in clip space, scaling by
clip.w and dividing x by the aspect ratio to keep quads square:
let particle = particles[instance_index];
var clip = camera.transform * vec4<f32>(particle.pos_mass.xyz, 1.0);
// Offset in clip space so the billboard is a constant size on screen,
// independent of depth; divide x by aspect to keep it square.
clip.x += corner.x * camera.size * clip.w / camera.aspect;
clip.y += corner.y * camera.size * clip.w;
The fragment shader draws a soft radial falloff, and additive blending sums overlapping stars so dense regions build into a bright core — then a bright-pass / separable-blur / tonemap pass adds the bloom that gives the galaxy its glow.
07The wasm-bindgen boundary
The page owns a single AppState behind a thread_local cell. WebAssembly in
the browser is single-threaded, so a RefCell<Option<Rc<...>>> is enough:
no locks, no Send bounds to satisfy. Every UI control is just a small
#[wasm_bindgen] free function the JavaScript calls:
#[wasm_bindgen]
pub fn set_halo(halo_v0: f32) {
let halo_v0 = halo_v0.clamp(0.0, 400.0);
APP_STATE.with(|cell| {
if let Some(app) = cell.borrow().as_ref() {
app.borrow_mut().set_halo(halo_v0); // rewrites the params uniform — live, no re-seed
}
});
}
Live knobs (gravity, halo strength) just rewrite a small uniform buffer; seed-time knobs (body count, Toomre Q, gas, bulge) regenerate the initial conditions and re-upload the particle buffer.
08Patterns that paid off
- Mind WGSL alignment.
vec3takes a 16-byte slot; pack withvec4or pad uniforms, and assert the layout in a test (§02). Silent stride mismatches are miserable to debug. - Keep state on the GPU. A per-frame readback stalls the pipeline waiting for the device. The only readback here is asynchronous and single-in-flight: a flag skips it while a previous map is pending, which naturally throttles it to the GPU round-trip rate and keeps it off the hot path.
- Separate passes for read/write hazards. Gravity reads what drift wrote — distinct compute passes give you the ordering guarantee for free (§04).
- Make the count a multiple of the workgroup size. Then the inner tile loop needs no bounds check, and every dispatch divides evenly.
- Fixed timestep + accumulator. The display rate never changes integration accuracy (§05).
- Deterministic RNG. A fixed-seed
StdRngwithgetrandomdisabled gives repeatable initial conditions and no OS-entropy dependency to configure for wasm. - Cast, don't serialise.
bytemuckturns#[repr(C)]structs into byte slices for buffer writes with zero copies.
WebGPU is broadly available now, but exact support still depends on browser, OS, GPU and driver.
Feature-detect navigator.gpu and show a friendly message rather than a blank canvas
when it is missing.
09Build & ship
The build is two steps: compile the crate to wasm and copy the static shell alongside it.
# Rust + shaders -> WebAssembly + JS glue, into pkg/
wasm-pack build --target web --release --out-dir pkg --out-name galacto
# Copy the HTML/CSS shell next to it, then stamp ?v=<git-sha> URLs
cp -r static/* pkg/ && node scripts/cache-bust.mjs
The output in pkg/ is a plain static site — no server, no backend — so it deploys to any
static host (Galacto runs on Cloudflare Pages). The build stamps the JS and CSS URLs with the git
SHA and gives the service worker a per-deploy cache name, so repeat visits are quick, offline launch
works, and the next navigation can still see a fresh deploy.