E4S AI & Machine Learning Guide

Construct your prompt from the instructions below then use the E4S Guide Bot

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) are rapidly evolving areas that increasingly intersect with high-performance computing (HPC). Within the E4S ecosystem, AI/ML tools are selected and supported to provide scalable, portable, and sustainable foundations for scientific discovery. These tools range from industry-standard frameworks such as TensorFlow and PyTorch, to specialized scientific and workflow-oriented environments like DeepHyper, LBANN, and SmartSim.

Selecting the right AI/ML library or tool depends on understanding both the characteristics of your problem and the environment in which you will develop and run it. E4S provides a curated set of interoperable, performance-tuned AI/ML products, making it easier for researchers to combine familiar AI workflows with HPC architectures.

Example: I am training a deep neural network to emulate a climate simulation model.
My data are multi-dimensional arrays stored in HDF5 format, generated from HPC simulations.
The training will run on a large GPU-based supercomputer that uses NVIDIA A100 devices and MPI for distributed communication.
I want to use mixed-precision training for better performance but need high numerical accuracy during validation.
The model should be exportable to ONNX for inference on different systems.
I also need to perform hyperparameter optimization across hundreds of nodes using the batch scheduler.
Please suggest which AI/ML libraries or tools in E4S are best suited for this task, and explain why.

The following tables outline attributes that can help a newcomer — or an automated assistant — reason about which AI/ML tools best fit a given use case. These attributes are divided into broadly meaningful attributes and those specific to certain situations.

Broadly Meaningful Attributes

Attribute	Description
Primary goal	The main purpose of the AI/ML task, such as training, inference, surrogate modeling, or reinforcement learning.
Data modality	The type of data used, such as image, text, tabular, time series, graph, or multi-modal combinations.
Computational scale	The size and complexity of the workload, ranging from single-node prototyping to large-scale distributed training across supercomputers.
Hardware targets	The intended hardware platform(s), such as CPU, NVIDIA GPU, AMD GPU, Intel GPU, or other accelerators.
Precision requirements	The numeric precision(s) used during training or inference (e.g., FP64, FP32, BF16, FP8) and support for mixed or adaptive precision.
Framework interoperability	Compatibility with major frameworks such as PyTorch, TensorFlow, JAX, or ONNX.
HPC integration	Availability of MPI, NCCL, RCCL, oneCCL, or other communication libraries for distributed computation.
Portability	The ability to run effectively on different architectures and compilers through abstractions like Kokkos or SYCL.
Licensing and support model	Type of license (e.g., open-source, permissive, copyleft) and level of community or vendor support.
Maturity and adoption	Stability, user base, and long-term support within the E4S or broader scientific community.
Ease of use	The learning curve and availability of documentation, examples, and APIs.
Extensibility	The ability to integrate custom operators, solvers, or domain-specific modules.
Workflow integration	Compatibility with workflow tools (e.g., SmartSim, DeepHyper, or MLFlow) and data pipelines in HPC environments.

Situation-Specific Attributes

For Training Deep Neural Networks

Attribute	Description
Parallelism model	Supported training parallelism types: data, model, pipeline, or hybrid.
Gradient synchronization	Methods used for distributed optimization (e.g., AllReduce, parameter server, decentralized).
Checkpointing	Capabilities for saving and restoring training state efficiently at scale.
Data loading	Support for streaming or parallel I/O with HPC file systems.
Mixed precision optimization	Automatic handling of reduced-precision arithmetic for speed and memory efficiency.

For Inference and Deployment

Attribute	Description
Latency sensitivity	Acceptable inference delay (e.g., real-time, batch, or offline processing).
Model format	Supported model export and import standards (e.g., ONNX, SavedModel, TorchScript).
Accelerator compatibility	Ability to deploy on specialized inference hardware (e.g., TensorRT, Habana, Intel Gaudi).
Scaling method	Mechanism for parallel inference, replication, or sharding across compute nodes.
Resource management	Integration with schedulers and container runtimes such as Slurm, Kubernetes, or Singularity.

For Scientific Surrogate Modeling or Emulation

Attribute	Description
Physics-informed capability	Ability to incorporate physical constraints or governing equations (e.g., PINNs).
Uncertainty quantification	Support for probabilistic modeling or Bayesian inference.
Integration with simulation data	Native support for HDF5, ADIOS2, or custom data formats common in HPC.
Surrogate training scalability	Ability to scale to large training datasets from simulation output.
Coupling with simulation codes	APIs for embedding inference directly within simulation workflows.

For Hyperparameter Optimization and Workflow Automation

Attribute	Description
Search strategies	Types of hyperparameter search supported (e.g., random, Bayesian, evolutionary).
Scheduler awareness	Integration with HPC schedulers for parallel job launches.
Experiment tracking	Built-in tools for tracking experiments, configurations, and results.
Automation framework	Compatibility with tools like DeepHyper, Ray Tune, or MLFlow.
Reproducibility	Mechanisms to ensure deterministic experiments and versioned configurations.

For Edge or Hybrid HPC-AI Environments

Attribute	Description
Resource heterogeneity	Support for distributed execution across mixed CPU-GPU or edge-cloud systems.
Model compression	Ability to quantize or prune models for lightweight deployment.
Data streaming	Support for continuous data ingestion and inference pipelines.
Connectivity requirements	Handling of intermittent network connections or federated learning setups.
Security and privacy	Support for encrypted models, federated updates, or privacy-preserving training.