Batch Processing: Processing Molecular Datasets¶
What You'll Learn¶
- How to construct
mol_idxtensors for batching molecules of different sizes - How to process multi-frame XYZ files efficiently with AIMNet2
- How to manage GPU memory when processing large datasets
- How to batch molecules with different atom counts using the flat coordinate format
Prerequisites¶
- AIMNet2 installed (
pip install aimnet) - Familiarity with PyTorch tensors and basic AIMNet2 usage (see First Calculation)
- A multi-structure dataset (XYZ file, SDF, or similar)
Step 1: Understanding the Flat Coordinate Format¶
When processing multiple molecules together, AIMNet2 uses a flat coordinate format. Instead of padding all molecules to the same size, you concatenate all atomic coordinates into a single (N_total, 3) tensor and use mol_idx to indicate which molecule each atom belongs to:
import torch
from aimnet.calculators import AIMNet2Calculator
calc = AIMNet2Calculator("aimnet2")
# Water (3 atoms) + Methane (5 atoms) + Ammonia (4 atoms)
coords = torch.tensor([
# Water (molecule 0)
[0.000, 0.000, 0.117], # O
[0.000, 0.757, -0.469], # H
[0.000, -0.757, -0.469], # H
# Methane (molecule 1)
[0.000, 0.000, 0.000], # C
[0.629, 0.629, 0.629], # H
[0.629, -0.629, -0.629], # H
[-0.629, 0.629, -0.629], # H
[-0.629, -0.629, 0.629], # H
# Ammonia (molecule 2)
[0.000, 0.000, 0.116], # N
[0.000, 0.939, -0.271], # H
[0.813, -0.470, -0.271], # H
[-0.813, -0.470, -0.271], # H
])
numbers = torch.tensor([8, 1, 1, 6, 1, 1, 1, 1, 7, 1, 1, 1])
mol_idx = torch.tensor([0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2])
charge = torch.tensor([0.0, 0.0, 0.0]) # One charge per molecule
result = calc(
{"coord": coords, "numbers": numbers, "charge": charge, "mol_idx": mol_idx},
forces=True,
)
# Energy has shape (3,) -- one value per molecule
print(f"Energies: {result['energy']}")
# Forces has shape (12, 3) -- one force vector per atom
print(f"Forces shape: {result['forces'].shape}")
!!! warning "mol_idx must be sorted in non-decreasing order" The mol_idx tensor must be sorted: all atoms for molecule 0 come first, then all atoms for molecule 1, and so on. Values like [0, 1, 0, 1] will produce incorrect results. Always use [0, 0, 1, 1].
!!! warning "charge shape must match number of molecules" When using mol_idx, the charge tensor must have shape (num_molecules,) with one charge value per molecule. A scalar charge only works for single-molecule calculations.
Step 2: Processing a Multi-Frame XYZ File¶
A common workflow is reading conformers or trajectory frames from an XYZ file and processing them in batches. Here is how to do it with ASE for reading and the low-level calculator for batched inference:
import torch
from ase.io import read
from aimnet.calculators import AIMNet2Calculator
calc = AIMNet2Calculator("aimnet2")
# Read all frames from a multi-frame XYZ file
frames = read("conformers.xyz", index=":")
print(f"Loaded {len(frames)} conformers")
# All conformers of the same molecule have identical atom count
n_atoms = len(frames[0])
n_frames = len(frames)
Same-Size Molecules (3D Batched Input)¶
When all molecules have the same number of atoms (e.g., conformers of one molecule), you can use the simpler 3D batched format:
# Stack into (B, N, 3) tensor
coords = torch.tensor([f.positions for f in frames], dtype=torch.float32)
numbers = torch.tensor(frames[0].numbers, dtype=torch.long) # Same for all conformers
numbers = numbers.unsqueeze(0).expand(n_frames, -1) # (B, N)
charge = torch.zeros(n_frames) # (B,)
result = calc(
{"coord": coords, "numbers": numbers, "charge": charge},
forces=True,
)
# result["energy"] has shape (B,)
# result["forces"] has shape (B, N, 3)
energies = result["energy"]
relative = energies - energies.min()
print(f"Relative energies (eV): {relative}")
!!! tip For conformers of the same molecule, the 3D batched format (B, N, 3) is more convenient than flat coordinates with mol_idx. The calculator automatically decides between dense mode (small molecules on GPU) and sparse mode (large molecules or CPU).
Different-Size Molecules (Flat Format with mol_idx)¶
When molecules have different atom counts, use the flat format:
# Example: processing a dataset with varied molecules
molecules = [
{"symbols": [8, 1, 1], "coords": [[0,0,0.12], [0,0.76,-0.47], [0,-0.76,-0.47]], "charge": 0.0},
{"symbols": [6, 1, 1, 1, 1], "coords": [[0,0,0], [.63,.63,.63], [.63,-.63,-.63], [-.63,.63,-.63], [-.63,-.63,.63]], "charge": 0.0},
]
all_coords = []
all_numbers = []
all_mol_idx = []
all_charges = []
for i, mol in enumerate(molecules):
n = len(mol["symbols"])
all_coords.extend(mol["coords"])
all_numbers.extend(mol["symbols"])
all_mol_idx.extend([i] * n)
all_charges.append(mol["charge"])
result = calc({
"coord": torch.tensor(all_coords, dtype=torch.float32),
"numbers": torch.tensor(all_numbers, dtype=torch.long),
"mol_idx": torch.tensor(all_mol_idx, dtype=torch.long),
"charge": torch.tensor(all_charges, dtype=torch.float32),
}, forces=True)
Step 3: Chunked Processing for Large Datasets¶
For datasets with thousands of structures, processing everything at once may exceed GPU memory. Break the dataset into chunks:
import torch
from ase.io import read
from aimnet.calculators import AIMNet2Calculator
calc = AIMNet2Calculator("aimnet2")
frames = read("large_dataset.xyz", index=":")
n_atoms = len(frames[0])
batch_size = 64 # Adjust based on available GPU memory
all_energies = []
all_forces = []
for start in range(0, len(frames), batch_size):
batch = frames[start:start + batch_size]
B = len(batch)
coords = torch.tensor([f.positions for f in batch], dtype=torch.float32)
numbers = torch.tensor(batch[0].numbers).unsqueeze(0).expand(B, -1)
charge = torch.zeros(B)
result = calc(
{"coord": coords, "numbers": numbers, "charge": charge},
forces=True,
)
# Move results to CPU immediately to free GPU memory
all_energies.append(result["energy"].cpu())
all_forces.append(result["forces"].cpu())
# Free GPU cache periodically
if torch.cuda.is_available():
torch.cuda.empty_cache()
energies = torch.cat(all_energies)
forces = torch.cat(all_forces)
print(f"Processed {len(energies)} structures")
!!! tip "Memory management" Calling .cpu() on result tensors moves them off the GPU immediately. Combined with torch.cuda.empty_cache(), this prevents GPU memory from growing unboundedly across batches. This is especially important when processing thousands of structures.
Step 4: Worked Example -- Conformer Ranking¶
Here is a complete example that reads conformers, computes energies in batches, and ranks them by relative energy:
import torch
from ase.io import read
from aimnet.calculators import AIMNet2Calculator
# --- 1. Load conformers ---
frames = read("taxol_conformers.xyz", index=":")
n_conformers = len(frames)
n_atoms = len(frames[0])
print(f"Loaded {n_conformers} conformers of {n_atoms}-atom molecule")
# --- 2. Set up calculator ---
calc = AIMNet2Calculator("aimnet2")
# --- 3. Batch-process conformers ---
batch_size = 32
all_energies = []
for start in range(0, n_conformers, batch_size):
batch = frames[start:start + batch_size]
B = len(batch)
coords = torch.tensor([f.positions for f in batch], dtype=torch.float32)
numbers = torch.tensor(batch[0].numbers).unsqueeze(0).expand(B, -1)
charge = torch.zeros(B)
result = calc({"coord": coords, "numbers": numbers, "charge": charge})
all_energies.append(result["energy"].cpu())
energies = torch.cat(all_energies)
# --- 4. Rank conformers ---
relative = energies - energies.min()
# Convert to kcal/mol for chemical interpretation
EV_TO_KCAL = 23.0609
relative_kcal = relative * EV_TO_KCAL
# Sort by energy
order = relative_kcal.argsort()
print("\nConformer ranking:")
print(f"{'Rank':<6} {'Index':<8} {'Rel. Energy (kcal/mol)':<24}")
for rank, idx in enumerate(order[:10]):
print(f"{rank+1:<6} {idx.item():<8} {relative_kcal[idx].item():<24.2f}")
# --- 5. Boltzmann populations at 298 K ---
kT = 0.001987 * 298 # kcal/mol
weights = torch.exp(-relative_kcal / kT)
populations = weights / weights.sum()
print(f"\nTop conformer population: {populations[order[0]].item():.1%}")
print(f"Conformers within 2 kcal/mol: {(relative_kcal < 2.0).sum().item()}")
Step 5: Processing Mixed Datasets with Different Charges¶
When your dataset contains molecules with different charges, build the mol_idx and charge tensors carefully:
import torch
from aimnet.calculators import AIMNet2Calculator
calc = AIMNet2Calculator("aimnet2")
# Dataset: neutral water, ammonium cation, hydroxide anion
dataset = [
{"z": [8, 1, 1], "pos": [[0,0,0.12], [0,0.76,-0.47], [0,-0.76,-0.47]], "q": 0.0},
{"z": [7, 1, 1, 1, 1], "pos": [[0,0,0], [.59,.59,.59], [.59,-.59,-.59], [-.59,.59,-.59], [-.59,-.59,.59]], "q": 1.0},
{"z": [8, 1], "pos": [[0,0,0], [0.96,0,0]], "q": -1.0},
]
coords, numbers, mol_idx, charges = [], [], [], []
for i, mol in enumerate(dataset):
n = len(mol["z"])
coords.extend(mol["pos"])
numbers.extend(mol["z"])
mol_idx.extend([i] * n)
charges.append(mol["q"])
result = calc({
"coord": torch.tensor(coords, dtype=torch.float32),
"numbers": torch.tensor(numbers, dtype=torch.long),
"mol_idx": torch.tensor(mol_idx, dtype=torch.long),
"charge": torch.tensor(charges, dtype=torch.float32),
}, forces=True)
# Per-molecule energies
for i, mol in enumerate(dataset):
print(f"Molecule {i} (charge={mol['q']:+.0f}): {result['energy'][i].item():.4f} eV")
Common Pitfalls¶
Unsorted mol_idx
````python # WRONG -- mol_idx is not sorted mol_idx = torch.tensor([0, 1, 0, 1])
# CORRECT -- atoms grouped by molecule
mol_idx = torch.tensor([0, 0, 1, 1])
```
Scalar charge with batched input
```python # WRONG -- scalar charge with multiple molecules charge = 0.0
# CORRECT -- one charge per molecule
charge = torch.tensor([0.0, 0.0, 0.0])
```
Hessian is single-molecule only
Hessian calculation (hessian=True) is not supported for batched inputs. If you
need Hessians, compute them one molecule at a time. Hessian computation also
requires O(N^2) memory, so it is impractical for large molecules.
What's Next¶
- Performance Tuning -- Optimize batch size, compilation, and GPU memory
- Geometry Optimization -- Optimize structures before ranking
- Conformer Search -- Full conformational sampling workflow ````