Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm

Birds of a Feather

Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm

Authors: Ada Sedova (Oak Ridge National Laboratory (ORNL)), Venkatram Vishwanath (Argonne National Laboratory (ANL)), Wesley Brewer (Oak Ridge National Laboratory (ORNL)), Duncan Riach (NVIDIA Corporation), J. Austin Ellis (Advanced Micro Devices (AMD) Inc), Andrew Shao (Hewlett Packard Enterprise (HPE)), Riccardo Balin (Argonne National Laboratory (ANL)), Steve Poole (Los Alamos National Laboratory (LANL)), NIck Hengartner (Los Alamos National Laboratory (LANL)), Daniel O'Malley (Los Alamos National Laboratory), Oscar Hernandez (Oak Ridge National Laboratory)

Abstract: This Birds of a Feather session, “Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm,” continues a series started in 2021 with a theme of discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with machine learning for state-of-the-art research. In this installment, we focus on sustainability and assurance for coupled simulation and deep learning. We discuss the current state and needs for enabling integration of HPC simulation with modern deep learning stacks to provide transformative scientific discoveries while delivering productivity, portability, and correctness for safety and mission critical applications.

Long Description: This is the second in an on-going series of Birds of a Feather (BoF) sessions, “Two Worlds Collide.” The series highlights current experiences, challenges, and future opportunities of several laboratories as they grapple with the need to leverage machine learning to advance state-of-the-art research. The goal of this installment subtitled “Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm,” is to help support a sustainable and assured integration between established HPC simulation and the rapidly developing deep learning (DL) ecosystem.

Exciting new possibilities in simulation have materialized from the combination of DL and HPC. A new type of programming environment has emerged: one which must support the seamless integration of simulation applications with deep learning frameworks using methods such as in-memory coupling and inference serving. However, the use of industry-developed DL frameworks and their incorporation into the HPC programming environment brings a slew of challenges for sustainability, productivity, and assurance. HPC and DL have different standards and philosophies for their software development and sustainability cycles, including a focus on specialization and performance versus rapid prototyping and mass deployment, established practices for correctness testing, build systems, optimization strategies for critical workloads, and choice of programming languages. For the coupled HPC simulation/DL ecosystem, strategies for reproducibility, verification and validation (V&V), and productivity/portability within this new integrated environment have not been established. Addressing this is imperative to demonstrate readiness for scientific use. A critical question is: how can these two communities work together to develop sustainable, integrated programming environments that are trustworthy, vetted, and portable? Additional considerations relate to data, and how the importance of data influences the software ecosystem: HPC simulation has traditionally depended less on input data for success, while choice of and access to the right data is essential for successful training of high-accuracy models.

The laboratories and supercomputing centers represented in this BoF include Argonne National Laboratory (ALCF), Los Alamos National Laboratory, Lawrence Livermore National Laboratory, and Oak Ridge National Laboratory (OLCF). Also represented from industry will be Hewlett Packard Enterprise (HPE) and the research started at Cray around simulation and AI, that has carried over to today, as well as AMD and NVIDIA. Participants will demonstrate what work is currently being conducted at their centers to set the context for further discussion.

Several outcomes are expected of this BoF: 1) we will develop an understanding of current capabilities that researchers can leverage today, allowing participants new to the field to see what is possible, given today’s technology; 2) we’ll enumerate desired outcomes that participants would feel would most advance current state-of-the-art research if those capabilities were available; 3) we will capture the delta between these first two exercises to use as a baseline to gauge progress over the next several years’ time, 4) we’ll summarize the challenges and pain points of the current ecosystem and develop action items and strategies to address them. This BoF seeks to cultivate a burgeoning community interested in the intersection of HPC Simulation and the AI/ML stack.

Back to Birds of a Feather Archive Listing