Parallel Computing and MPI with Bamboost
Bamboost is designed to scale seamlessly from interactive single-process scripts on a laptop to highly parallelized simulation workflows running on High-Performance Computing (HPC) clusters.
To achieve this, Bamboost integrates native Message Passing Interface (MPI) capabilities, allowing multiple processes to collaborate when managing, indexing, and writing scientific datasets in HDF5 format.
This guide provides a comprehensive overview of how parallel execution works in Bamboost, how to write MPI-safe code, and the underlying architectural patterns of the codebase.
MPI Auto-Detection & The Serial Mock
The Proxy Pattern: bamboost.mpi.MPI
To avoid unnecessary overhead and simplify local development, Bamboost uses a proxy pattern to abstract the underlying MPI engine.
Always import MPI from Bamboost instead of mpi4py directly:
from bamboost.mpi import MPIUpon import, Bamboost dynamically detects the execution context and selects the appropriate backend:
- Real MPI: If MPI detection succeeds and
mpi4pyis installed, Bamboost loads the realmpi4py.MPImodule and setsMPI.enabled = True. - Serial Fallback (Mock MPI): If no MPI environment is detected, Bamboost falls back
to its serial mock submodule (
bamboost.mpi.serial) and setsMPI.enabled = False.
Activating MPI Mode
By default, config.options.mpi is False. You can enable MPI detection using one of
these three methods:
| Method | Configuration | Description |
|---|---|---|
| Environment Override (Highest Priority) | export BAMBOOST_MPI=1 | Forces MPI detection and initialization. Use =0 to explicitly disable. |
| Config File | [options] mpi = true in bamboost.toml | Enables MPI detection; Bamboost then checks for standard launcher environment variables. |
| Runtime Scripting | config.options.mpi = True | Sets the flag dynamically before the MPI module is used. |
When config.options.mpi is enabled, Bamboost checks for standard launcher environment
variables (e.g. OMPI_COMM_WORLD_SIZE, SLURM_PROCID, PMI_SIZE, I_MPI_RANK, etc.).
If any are present, MPI is initialized. If none are found, Bamboost logs a warning and
falls back to the serial mock.
Parallel HDF5 requires that your h5py installation was compiled with parallel MPI
support. You can verify this with h5py.get_config().mpi. If you enable MPI in Bamboost
but h5py lacks MPI support, Bamboost raises a RuntimeError on startup to prevent
silent data corruption.
The Serial Mock Module (serial.py)
To ensure your code remains completely portable, bamboost.mpi.serial exposes the same
communicator interfaces as mpi4py.MPI:
- Communicators:
COMM_WORLDandCOMM_SELFare distinctSerialComminstances withrank = 0andsize = 1. They are kept as separate objects so that code paths which distinguish between the two (e.g. insidecomm_self) behave correctly in serial mode. - Identity Collective Operations:
bcast(),scatter(),gather(),allreduce(), andreduce()act as identity functions (returning input unchanged).allgather()returns a single-element list containing the input.barrier()is a no-op. - Reduction Operators:
SUMandMAXare provided as plain functions compatible with the operator argument ofallreduce/reduce. - Fail-Fast
COMM_NULL:COMM_NULLis aNullComminstance. Any attribute access on it raises aRuntimeErrorimmediately, making programming errors visible during serial tests.
Synchronization in UIDs and Names
In parallel execution, generating randomized names or unique identifiers independently on different ranks leads to desynchronization. Bamboost coordinates these operations natively.
CollectionUID and SimulationName
Both classes are MPI-aware. When initialized with a communicator, they automatically coordinate across ranks by letting Rank 0 generate the value and broadcasting it:
# Inside CollectionUID.generate_uid / SimulationName.generate_name:
uid = uuid.uuid4().hex[:length]
if comm is not None:
uid = comm.bcast(uid, root=0)By delegating generation to Rank 0 and broadcasting the result, Bamboost guarantees that every process in the communicator operates on identical simulation folders, files, and metadata records.
Communicator Propagation & Descriptors
Managing MPI communicators across nested hierarchies (e.g. from Collections to
Simulations, and down to underlying HDF5 files and writers) is handled by a custom
Communicator descriptor and the ReuseComm marker class.
Custom Communicators
You can pass a custom or split sub-communicator when instantiating a Collection:
from bamboost import Collection
from bamboost.mpi import MPI
comm = MPI.COMM_WORLD
# Split the communicator to run concurrent simulation batches
color = comm.rank % 2
sub_comm = comm.Split(color, comm.rank)
# Pass the custom sub-communicator during Collection initialization
collection = Collection("/path/to/my/data", comm=sub_comm)Auto-Propagation Mechanism
When a communicator is specified on a parent object, all nested objects inherit it
automatically through the ReuseComm marker. Child objects register themselves via
self._comm = ReuseComm(parent), and the Communicator descriptor traverses the
parent chain at access time to resolve the active communicator.
graph TD
Collection["Collection (Custom Comm)"] -->|ReuseComm| Index["Index Instance"]
Collection -->|ReuseComm| Simulation["Simulation Object"]
Simulation -->|ReuseComm| HDF5File["HDF5File Wrapper"]
Simulation -->|ReuseComm| XDMFWriter["XDMFWriter Instance"]
HDF5File -->|ReuseComm| Group["HDF5 Group / Dataset"]The Communicator descriptor maintains two weak-reference lookup tables
(_child_to_parent_map and _instance_comms). When a child object accesses its _comm
attribute, Bamboost traverses the weak-reference parent chain to locate and reuse the
active communicator instance. If no parent in the chain has an explicitly set communicator,
MPI.COMM_WORLD is returned as the default.
Writing Large Datasets in Parallel (MPIO)
For raw numerical datasets, Bamboost leverages HDF5's parallel MPIO driver
(driver="mpio"), enabling all MPI processes to write their chunks to the same file
concurrently.
While these writes are collective if a real communicator is present, the following functions are perfectly safe to call in serial mode as well, since the mock communicator simply executes them as identity operations.
Unified Entry Point (write_distributed_array)
The primary method for parallel writing is write_distributed_array. It dispatches to
contiguous or scattered writing based on whether indices is provided:
# Contiguous write (indices=None, default)
group.write_distributed_array("positions", local_data)
# Scattered write (non-contiguous global positions)
group.write_distributed_array("dof_data", local_data, indices=global_indices)Series fields
StepWriter.add_field(...) uses write_distributed_array under the hood, so
these parallel writing capabilities are also available when writing to series
fields.
The default for add_field is indices=None, which means it will call the contiguous variant.
indices can be used with a local-to-global dof_map in a FEM context.
However, consider that non-contiguous writes are much slower.
A. Contiguous Distributed Arrays (write_distributed_contiguous_array)
Use this when each rank holds a portion of a larger contiguous array. The ranks
collectively run allgather to determine the global shape and calculate
non-overlapping slice offsets automatically:
import numpy as np
from bamboost import Collection
from bamboost.mpi import MPI
comm = MPI.COMM_WORLD
collection = Collection("./data", comm=comm)
# Simulation creation is a collective action!
sim = collection.add("parallel_contiguous_sim", override=True)
# Generate local data chunk on each rank
local_size = 50
local_data = np.random.rand(local_size, 3)
# Write collectively inside edit context
with sim.edit() as writer:
# All ranks MUST enter this block and call this collectively
writer.root.write_distributed_contiguous_array("positions", local_data)B. Scattered Distributed Arrays (write_distributed_scattered_array)
Use this when processes write to non-contiguous positions in a global array (e.g. finite
element Degree of Freedom maps). Bamboost sorts the local indices and permutes the
corresponding data on each rank to satisfy HDF5's strictly increasing index constraint.
The global dataset size is determined via allreduce(MPI.MAX) on the maximum index across
all processes:
with sim.edit() as writer:
# Each rank specifies its target global index mapping and values
global_indices = np.array([comm.rank, comm.rank + comm.size], dtype=np.int64)
local_data = np.array([float(comm.rank), float(comm.rank * 10)], dtype=np.float64)
# All ranks execute this together
writer.root.write_distributed_scattered_array(
"scattered_sensor_data",
indices=global_indices,
vector=local_data,
)Collective Metaprogramming (RootProcessMeta)
Certain operations — such as writing to the SQLite index database, updating structural
XMLs (XDMFWriter), or modifying HDF5 metadata attributes — must only be executed by a
single process to avoid race conditions and file lock contentions.
Bamboost solves this with RootProcessMeta (bamboost.mpi.utilities), a metaclass
that automatically makes classes MPI-safe.
Automatic Method Wrapping
When a class uses RootProcessMeta, all its callable methods are automatically wrapped
during class creation. The only exceptions are:
- Methods in
__exclude__(__init__,__new__) staticmethodandclassmethoddefinitions- Methods explicitly decorated with
@RootProcessMeta.exclude
Every other method is wrapped with @RootProcessMeta.bcast_result:
@staticmethod
def bcast_result(func):
@wraps(func)
def wrapper(self, *args, **kwargs):
status = True
result = None
exc = None
if self._comm.rank == 0:
try:
# Rank 0 temporarily switches comm to COMM_SELF to avoid nested deadlocks
with comm_self(self):
result = func(self, *args, **kwargs)
except Exception as e:
status = False
exc = e
# Synchronize status, result, and exceptions collectively across all ranks
broadcast_data = self._comm.bcast((status, result, exc), root=0)
# If Rank 0 encountered an exception, raise it collectively on all processes
if not broadcast_data[0]:
raise broadcast_data[2]
return broadcast_data[1]
return wrapperThe @RootProcessMeta.exclude decorator marks a method to run independently on all
ranks without any MPI synchronization (useful for context managers or collective
operations that are already safe).
Key Benefits of this Pattern:
- Safety: Single-process writes are guaranteed because the actual logic runs solely on Rank 0.
- Crash Resilience: If Rank 0 encounters an exception, it is broadcast and raised collectively on all processes. Non-root ranks never hang indefinitely on subsequent collective barriers.
- Transparency: Non-root ranks automatically receive and return the serialized result of the method call as if they had executed it locally.
Deferred Attribute Writing (SingleProcessQueue)
Modifying HDF5 file metadata (like adding dataset attributes) in the middle of a parallel MPIO session can cause driver lockups.
Bamboost handles this with a hybrid execution queue:
- Inside
write_distributed_contiguous_array(and the scattered variant), raw arrays are written concurrently using MPIO. - Any metadata modifications (e.g. updating dataset
attrs) are instead deferred viapost_write_instruction(...)into aSingleProcessQueue.SingleProcessQueueitself usesRootProcessMeta, so all of its methods run only on Rank 0. - When the outermost
HDF5Filecontext closes on all ranks,close()triggerssingle_process_queue.apply(). This opens the file in standardappendmode (no MPIO driver) and flushes the queue sequentially, executing all deferred metadata writes cleanly.
If the file is not open under the MPIO driver (or is already closed), post_write_instruction
executes the instruction immediately rather than deferring it. This avoids unnecessary
queuing for serial or single-process usage.
Preventing Deadlocks with comm_self
A nested collective deadlock is a classic MPI hazard. It occurs when a method wrapped by
RootProcessMeta (running only on Rank 0) internally calls another wrapped method. Rank 0
blocks waiting for an inner bcast, while non-root ranks are blocked waiting for the
outer bcast.
Bamboost avoids this using the comm_self context manager:
from bamboost.mpi.utilities import comm_selfInside a with comm_self(obj): block, the target object's active communicator is
temporarily swapped to MPI.COMM_SELF. Since COMM_SELF has only one rank, all inner
collective operations and broadcasts resolve instantly without waiting for other processes.
RootProcessMeta.bcast_result uses comm_self automatically when calling the wrapped
function on Rank 0, which is why nested RootProcessMeta calls do not deadlock.
When should you use comm_self?
If you are writing a custom routine that runs only on Rank 0 (for example, inside
an if rank == 0: block) and need to call Bamboost operations, you must wrap them
in a comm_self context to bypass the global communicator:
if comm.rank == 0:
# Run some custom single-process analytical step
with comm_self(sim):
# Without comm_self this would deadlock — other ranks are not participating
sim.data.root.add_dataset("single_process_metadata", [1.0, 2.0, 3.0])comm_self correctly handles both serial mock communicators and real mpi4py
communicators: it swaps the mock COMM_SELF for serial runs and the real
mpi4py.MPI.COMM_SELF for genuine MPI runs, avoiding import overhead in serial mode.
Best Practices and Checklist
- Import
MPIfrombamboost.mpi: Never usefrom mpi4py import MPIdirectly inside your Bamboost workflows. This allows you to write code that is portable between serial and parallel execution without modification. - Pass
commduringCollectioninitialization: This ensures all nested objects and index instances inherit the correct communicator automatically. - Collective Simulation Creation: Always create or add simulations collectively on
all processes to ensure unique IDs and names remain synchronized:
# Run collectively across all ranks! sim = collection.add("my_mpi_simulation", override=True) - Dataset writes must be collective: Never place
write_distributed_contiguous_array,write_distributed_scattered_array, orwrite_distributed_arrayinside rank-conditional blocks (e.g.if rank == 0:). These are collective operations and require participation from all ranks. - Wrap Rank 0 conditional blocks: If you perform any non-collective Bamboost
operations inside an
if rank == 0:block, always wrap them inwith comm_self(obj):to avoid deadlocks. - Install parallel HDF5: Verify that
h5pyis compiled against your system's MPI implementation (h5py.get_config().mpimust beTrue).
Bamboost