GPU Validation (torch / warp-lang / nvalchemiops coupling)¶
The CPU CI matrix verifies torch-core API and CPU numerics across PyTorch 2.8–2.12, but it cannot validate three GPU-only concerns: that warp-lang and nvalchemiops install coherently with each torch version, that the Warp custom op kernels execute, and that full-model energies/forces stay reproducible across the range. This manual tool covers them on a CUDA machine.
Running it¶
make gpu-validate
This loops over PyTorch 2.8, 2.9, 2.10, 2.11, 2.12. For each it creates a fresh virtualenv, installs a resolver-coherent environment (the matching CUDA torch wheel and its triton, plus aimnet + warp-lang + nvalchemiops), runs pytest -m gpu, and computes deterministic energies/forces for a fixed set of systems (water, methane, caffeine, a periodic spiro crystal). A same-run torch 2.9 baseline is then used to diff every other version.
Inspect the commands without running anything:
DRY_RUN=1 bash scripts/gpu_validate.sh
Configuration¶
| Variable | Default | Notes |
|---|---|---|
CUDA_INDEX |
https://download.pytorch.org/whl/cu126 |
Pick the channel matching your driver. torch 2.12 defaults to CUDA 13; if a cu126 wheel is unavailable for a version, that leg reports INSTALL-FAIL — re-run that version with the appropriate --index-url. |
PYTHON |
python3.12 |
aimnet itself supports Python ≥ 3.11, but validation defaults to 3.12 because nvalchemiops's torch extra activates only at Python ≥ 3.12 — i.e. 3.12+ is where the coupling under test actually engages. |
TORCH_VERSIONS |
2.8 2.9 2.10 2.11 2.12 |
Space-separated minor versions. |
BASELINE |
2.9 |
torch version used as the energy/force reference. |
ENERGY_ATOL |
1e-5 |
Hartree. Matches the repo's ENERGY_ATOL. |
FORCE_ATOL |
1e-4 |
Hartree/Å. Looser than FORCE_ATOL=1e-5 to absorb legitimate cross-version GPU reduction-order variation; tighten if your hardware is stable. |
Reading the matrix¶
Each row is one torch version with install, gpu-suite, maxDE, maxDF, and a verdict:
PASS— installs, suite green, energies/forces within tolerance of baseline.BASELINE— the reference version.INSTALL-FAIL— the coherent env did not install (often the CUDA channel; sometimes a real warp/nvalchemiops-vs-torch break — the signal this tool exists to find).SUITE-FAIL—pytest -m gpufailed.DRIFT— energies or forces diverged beyond tolerance.NO-RESULTS— installed but the observables dump did not produce output.
A nonzero exit code means at least one leg failed. Paste the matrix into the release or PR note as the GPU-side evidence for the supported range. The real aimnet2 model downloads from GCS on first use, so the box needs network (or a pre-populated ~/.cache/aimnet).