Advanced Architecture "Playgrounds" - Past Lessons and Future Accesses of Testbeds

Birds of a Feather

Advanced Architecture "Playgrounds" - Past Lessons and Future Accesses of Testbeds

Authors: Jeffrey Young (Georgia Tech Research Institute), Jens Domke (RIKEN Center for Computational Science (R-CCS), RIKEN), Oscar Hernandez (Oak Ridge National Laboratory (ORNL)), Filippo Spiga (NVIDIA Corporation), Ross Miller (Oak Ridge National Laboratory (ORNL)), Nick Brown (Edinburgh Parallel Computing Centre), Honggao Liu (Texas A&M University), Dhruva Chakravorty (Texas A&M University), Jeff Vetter (Oak Ridge National Laboratory), Murali Emani (Argonne National Laboratory), Stephen Poole (Los Alamos National Laboratory)

Abstract: Testbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.

Long Description: The supercomputing community is in the midst of a period of unprecedented architectural innovation. The explosion in architectural diversity leads to a number of challenges, including understanding the potential performance impact of new architectural technologies on workloads of interest and guidance for architectural design from application and algorithm features.

To address these challenges, a variety of architectural testbed efforts have been established by leading worldwide HPC centre and national laboratories. Examples are CENATE (Pacific Northwest National Laboratory), HAAPS (Sandia National Laboratory), Rogues Gallery (Georgia Tech), OLCF and ExCL (Oak Ridge National Laboratory), ALCF AI testbed (Argonne National Laboratory), RIKEN's “virtual Fugaku” HPC cloud on AWS, and SmartNICs testbeds (Los Alamos National Laboratory). Foremost among them is the dimension of architectural diversity in processors, memory, and network that resulted from architectural designers grappling with increased demands on performance and energy efficiency in the processing, memory, storage and interconnect space.

A novel topic for this BoF will be evaluation of AI workloads and security practices. Generative AI and large language models (LLM) have proven to be transformational in tackling real-world problems like health analytics for vaccine candidate research and other critical health topics. The massive fast-paced adoption of these tools has put an extra pressure on HPC centers and national laboratories to understand both the hardware and the software side of them, which has similarities but also huge differences compared to classic HPC workload in computational science and engineering. Also, the explosion of AI accelerators, also pose the risk of understanding performance portability challenges in terms of accuracy and performance of AI models. Since deploying any large scale infrastructure specialized for a specific set of workloads is a huge investment, testbeds represent a viable way to de-risk prior adoption by understanding the hardware technology and track evolution of the software ecosystem. The security of data processing on these devices is critical to preserve user privacy but it is increasingly challenging because of the diverse system designs that exist in emerging architecture.

This BoF brings together researchers and practitioners involved in these programmes to share lessons learned from evaluating diverse architectures, testbed design principles, reproducible benchmarking methodologies, overall system evaluation, and experience on systems bring-up and availability. The audience is encouraged to be actively involved in discussion with topics ranging from applications evaluation, suitability of programming language, software stack maturity, resilience, security.

Building on the success of our previous BoF sessions in 2019 (50 attendees) and 2021 (48 attendees), we aim to build a vibrant community and foster a collaborative environment to discuss current and future challenges. The attendees will not only hear about lessons learned from a group of invited speakers, but also learn how to gain access to our test beds and be able to offer theirs in exchange.

Website: https://caatb.github.io/aatb-bofs/

Back to Birds of a Feather Archive Listing