bamboost
A python data framework for managing scientific simulation data.
Introduction
bamboost
is a python data framework designed for managing scientific simulation data.
It provides an organized model for storing, indexing, and retrieving
simulation results, making it easier to work with large-scale computational
studies.
In its core, it is a filesystem storage model, providing directories for
simulations, bundled in collections.
However, we enrich the experience with local sqlite databases.
bamboost
knows two entities; Collection and Simulation. Self-similar simulations are bundled in collections.
Principles
- Independence: Any dataset must be complete and understandable on it's own. You can copy or extract any of your data and distribute it without external dependencies.
- Path redundancy: Data must be referencable without knowledge of it's path. This serves several purposes: You can share your data easily ( supplementary material for papers), and renaming directories, moving files, switching computer, etc. will not break your data referencing.
This leads to the following requirements:
- Simulation parameters must be stored locally, inside the simulation directory. Crucially, not exclusively in a global database of any kind.
- Collections must have unique identifiers that are independent of its path.
- Simulations must have unique identifiers that are independent of its path.
Concept
We organize simulations in collections within structured directories. Let's consider the following directory:
This is a valid bamboost
collection at the path ./test_data
. It contains an
identifier file giving this collection a unique identifier. In this case, it is
ABCD1234
.
This file defines the unique ID of the collection.
It contains two entries; simulation_1
and simulation_2
.
As you can see, each simulation owns a directory inside a collection.
The directory names are simultaneously used as their name as well as their ID.
The unique identifier for a single simulation becomes the combination of the collection ID that it belongs to and the simulation ID.
That means, the full identifier of simulation_1
is ABCD1234:simulation_1
.
Each simulation contains a central HDF5 file named data.h5
. This file is used to store the parameters, as well as generated data.
The simulation API of bamboost
provides extensive functionality to store and retrieve data from this file. However, users are not limited to this file, or using python
in general.
The reason why simulations are directories instead of just a single HDF file is that you can dump any file that belongs to this simulation into its path. This can be output from 3rd party software (think LAMMPS), additional input files such as images, and also scripts to reproduce the generated data.
Indexing
Given the above described model, querying data is feasible but expensive.
If a user requires all simulations in collection ABCD1234
where param1 = 73
(just an example parameter), the following must happen:
- Search the file system for the identifier file
.bamboost-collection-ABCD1234
- Iterate through all subdirectories of the collection and gather the simulation's parameters from
data.h5
- Filter the simulations with
param1 = 73
To improve the experience, we cache collections, simulations, and their parameters in a global sqlite
database.
Crucially, this is an important but nonetheless only a convenience feature! Corruption, deletion or absence of the database has no consequences on the integrity of the data.
In fact, the cache will be automatically rebuilt accordingly. I mean, yep it's a cache.
Features
The core
functionality of bamboost
can be split into two parts:
- Structured file-based data model with a database-like experience
- A python hdf5 interface to easily store and retrieve data of simulations
Addtionally, we strive to offer the following:
- Manage simulation workflow, i.e. the creation and submission of simulations on HPC clusters
- A user interface (TUI) to view (and manage) all your data.
- Extensibility, e.g. software specific writers (e.g. FEniCS)
- ...
Installation
We recommended to use
uv
for python projects. Runuv add bamboost
to add the dependency to your project.
To install Bamboost, use pip
:
pip install bamboost
To install the latest development version from GitHub:
pip install git+git@cmbm-ethz.gitlab.com/bamboost
To install the latest version of the bamboost TUI
pip install git+git@github.com/zrlf/bamboost-tui
Dependencies
System dependencies
- HDF5 (optionally compiled with MPI)
- (optional) an MPI implementation
- SQLite (probably part of any os nowadays)
Python environment
python > 3.10
h5py
sqlalchemy
rich
- (optional)
mpi4py
andh5py
compiled against an mpi capable version of libhdf5
Next Steps
- Read the Getting Started guide.
- Learn about Managing Collections.
- Explore Simulation Data Handling.
- Understand HDF5 Data Storage.
- Use Command Line Tools.
- Enable MPI Parallel Computing.