E4S Programming Systems & Tools

The Ecosystem for Scientific Software (E4S) provides a curated collection of interoperable, performance-portable software packages that enable scalable, high-performance applications across diverse computing architectures. Within the E4S ecosystem, portable parallel programming systems play a critical role in achieving both productivity and performance on heterogeneous systems — including multicore CPUs, GPUs, and emerging accelerators — from vendors such as NVIDIA, AMD, and Intel.

E4S includes a set of portable programming models, frameworks, and runtime systems that allow application developers to write code once and execute efficiently across multiple platforms. These systems cover both intra-node (shared-memory and accelerator) parallelism and inter-node (distributed-memory) parallelism, while also leveraging modern language-supported features available in Fortran, C++, and LLVM-based compilers.

The Value of Portable Programming Layers

Portable programming layers abstract away hardware-specific details while maintaining high performance. This enables developers to target a wide range of architectures without rewriting large portions of their applications.

MPI: The Backbone of Distributed Scientific Computing

MPI (Message Passing Interface) remains the foundational model for distributed-memory programming. It provides a standardized API for exchanging messages among processes running on different nodes in a cluster. E4S supports MPICH, OpenMPI, and vendor-tuned variants (e.g., Cray MPICH), ensuring broad portability and performance across DOE supercomputers. MPI’s stability, mature tooling, and ecosystem-wide interoperability make it indispensable for scalable applications.

Kokkos and the Rise of Performance Portability

Kokkos, originating from Sandia National Laboratories, provides a C++ programming model for writing performance-portable code targeting multiple backends such as CUDA, HIP, SYCL, and OpenMP. Kokkos is tightly integrated with Trilinos, PETSc, and other E4S math libraries, allowing applications to exploit GPUs and CPUs through a single, modern C++ abstraction layer. Its companion ecosystem includes Kokkos Kernels and Kokkos Tools, providing optimized linear algebra and performance instrumentation capabilities.

OpenMP, OpenACC, and LLVM Compiler Advancements

Compiler-supported parallelism models such as OpenMP and OpenACC enable incremental parallelization of legacy codes using pragma directives. Recent advances in LLVM-based compilers — including Clang/Flang for C++ and Fortran — provide robust support for GPU offloading, unified memory, and mixed-language interoperability. These tools complement E4S programming systems by lowering barriers for performance portability in both research and production settings.

Intra-Node Programming Systems

These systems focus on exploiting shared-memory and accelerator-level parallelism within a compute node. They provide APIs and abstractions for multi-core CPUs, GPUs, and other accelerators.

System	Description	Key Features	Supported Architectures
Kokkos	C++ performance-portable programming model	Multi-backend (CUDA, HIP, SYCL, OpenMP), hierarchical parallelism, memory spaces	NVIDIA, AMD, Intel GPUs; CPUs
RAJA	Loop-based C++ abstraction layer developed by LLNL	Portable loop execution and memory management	CUDA, HIP, OpenMP, TBB
OpenMP (LLVM/Clang)	Directive-based parallelism model integrated in LLVM	Tasking, SIMD, GPU offload support	CPUs and GPUs across vendors
OpenACC	Directive-based accelerator programming model	High-level GPU offload without deep hardware knowledge	NVIDIA GPUs, multicore CPUs
Legion	Task-based parallel runtime for data-centric applications	Logical regions, dynamic scheduling	Multicore and distributed hybrid systems
UPC++	Asynchronous PGAS (Partitioned Global Address Space) library for C++	Remote memory access, futures, distributed objects	Shared and distributed memory systems

Inter-Node Programming Systems

These systems enable scalable distributed-memory programming, coordinating computation and communication across multiple nodes.

System	Description	Key Features	Supported Architectures
MPI (MPICH, OpenMPI)	Standard message-passing library for distributed-memory systems	Point-to-point, collective, one-sided, persistent communication	CPU/GPU clusters
Charm++	Message-driven, migratable objects for parallel applications	Adaptive load balancing, asynchronous execution	Clusters, exascale systems
HPX	Asynchronous C++ runtime system	Futures, active messages, fine-grained parallelism	Multicore and distributed nodes
PaRSEC	DAG-based runtime for distributed task scheduling	Dynamic dependency tracking	Heterogeneous clusters
Legion Realm Runtime	Distributed execution layer of Legion	Scalable data distribution and communication	Hybrid systems
GasNet-EX	Communication layer for PGAS languages (UPC++, Chapel)	High-performance low-level messaging	Clusters, leadership-class systems

Language-Supported Parallelism via LLVM Compilers

Modern Fortran and C++ compilers built on LLVM infrastructure (including Flang, Clang, and vendor derivatives) now provide direct language and directive-based access to parallelism, significantly narrowing the gap between compilers and specialized libraries.

Language / Compiler	Parallelism Model	E4S Relevance	Key Features
Fortran (Flang / LLVM Flang)	Coarrays, DO CONCURRENT, OpenMP	Interoperates with MPI and math libraries	Standard parallel features with GPU offloading
C++ (Clang / LLVM)	C++17/20 parallel STL, OpenMP, SYCL	Enables integration with Kokkos, RAJA, HPX	Unified parallel execution policies
SYCL (via DPC++, hipSYCL)	Single-source heterogeneous programming model	Bridges HPC and AI kernels in E4S	Cross-vendor GPU support, C++ templates
LLVM/OpenACC	Directive-based offload	Complements Kokkos and OpenMP for incremental migration	Portable performance on accelerators

Summary

E4S programming systems enable developers to write portable, high-performance applications capable of running efficiently across diverse architectures. By integrating low-level communication systems like MPI, node-level abstractions like Kokkos and RAJA, and modern compiler technologies supporting OpenMP, OpenACC, and SYCL, E4S ensures that scientific applications can evolve with hardware and remain sustainable over decades.

Together, these tools form the foundation of the DOE’s exascale software ecosystem — unifying productivity, portability, and performance across the world’s fastest computing platforms.