IPDRM 2023

Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware

Sunday, November 12th, 2023

Denver Convention Center, Room 505

Denver, Colorado, USA.

Held in conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, (SC 23), November 12-17, 2023, Denver, Colorado, USA.

Submission deadlines: August 11th, 2023 (AOE)

Overview

The role of runtime and middleware has evolved over the past several years as we have begun the exascale era. For leadership class machines, advanced runtime technology not only plays an important role in task scheduling and management but also has gained prominence in providing consistent memory across accelerator architectures, intelligent network routing, and performance portability, among other properties. With diminishing returns from hardware fabrication technology, clusters are beginning to include more specialized accelerators such as FPGAs, CGRAs, and custom ASICs. For popular domains such as machine learning, we have observed the return of the stand-alone appliance – i.e., highly specialized co-designed software/hardware products sold as black box units that are efficient at solving a popular (although narrow) problem space. These platforms highlight middleware challenges such as task and data management while adding new opportunities for the exploitation of application-specific engines. Further, advances in fields such as AI/ML provide new and exciting opportunities for guiding and exploiting the hardware/software substrate. This workshop aims to attract the international research community to share new and bold ideas that will address the challenges of design, implementation, and evaluation of future runtime systems and middleware.

This year, we will have a special emphasis on middleware that connects AI/ML to HPC systems. There are several challenges associated with the support of complex AI workflows leveraging heterogeneous compute and memory resources. For example, many challenges in the interaction between Python-based distributed deep learning frameworks and the underlying hardware and the implications for memory and communication management.

Topics

This workshop will emphasize novel, disruptive research ideas over incremental advances. We will solicit papers on topics including, but not limited to, the following areas:

Runtime System/Middleware Techniques. Design, and Evaluation

  • Runtime/Middleware for exascale/large scale computing
  • Runtime/Middleware for accelerators or appliances
  • Network and I/O middleware technology
  • Modeling and Performance Analysis of Runtime Systems
  • Interactions between runtime and middleware
  • Runtime-architecture co-design
  • Tuning and optimization studies
  • Workflow/application-centric challenges and solutions for runtime systems

Constraints and Issues for Runtime Systems and Middleware

  • Energy- and Power-aware schemes
  • Fault Tolerance and Reliability
  • Heterogenous resource management
  • Data movement
  • Memory models
  • Scalability

Design Principles and Programming Support

  • High-level programming models (e.g., thread and task based models, data parallel models, and stream programming) and domain-specific languages
  • Programming frameworks, parallel programming, and design methodologies
  • Methodologies and tools for runtime and middleware design, implementation , verification, and evaluation
  • Wild and crazy ideas on future Runtime System and Middleware

Submissions

  • Extended Paper Submission: August 11th 2023
  • Paper Notification: September 7th, 2023
  • Final Paper Due: September 27th, 2023

Submission Guidelines:

Full submission will be up to 8 pages long using the same format as the SC23 conference (i.e. using the ACM conference template). This limit includes all materials (figures, bibliography, appendixes, etc). All submitted papers will undergo a rigorous review process and each will have at least three reviews by members of the program committee. Papers will be accepted based on their technical contributions. “Crazy and Wild ideas” are welcome. Accepted papers will have quick lighting presentations on the workshop day to spark conversation and discussion. Papers can be submitted at SC Submission site

Organizing Committees

General Chairs

  • Barbara Chapman, HPE, USA
  • Shirley Moore, University of Texas at El Paso, USA
  • Eun Jun Park, Qualcomm, USA
  • Joseph Manzano, Pacific Northwest National Laboratory, USA

Program Chairs

  • Joshua Suetterlein, Pacific Northwest National Laboratory, USA

Publicity Chair:

  • Jose Monsalve Diaz, University of Delaware, Argonne National Laboratory, USA

Publication Chair:

  • Oceane Bel, Pacific Northwest National Laboratory, USA

Diversity Chair:

  • Cimone Wright-Hamor, Pacific Northwest National Laboratory, USA

Program Committee

  • Kevin J. Barker, Pacific Northwest National Laboratory, USA
  • Mehmet E Belviranli, Colorado School of Mines, USA
  • Nicolas Bohm Agostini, Pacific Northwest National Laboratory, Northeastern University, USA
  • Vincent Cave, Intel Corporation, USA
  • Serena Curzel, Polytechnic University of Milan, Italy
  • Ivy Peng, KTH Royal Institute of Technology, Sweden
  • Bin Ren, College of William & Mary, USA
  • Omer Subasi, Pacific Northwest National Laboratory, USA
  • Shubbhi Taneja, Worcester Polytechnic Institute, USA
  • Li Tang, Los Alamos National Laboratory, USA
  • Zhijia Zhao, University of California, Riverside, USA
  • Christopher Zimmer, Oak Ridge National Laboratory, USA
  • Stephane Zuckerman, CY Cergy Paris University; Laboratoire ETIS, France

Distinguished Speaker

We are proud to announce Dr. Shuaiwen Leon Song will be presenting our distinguished speaker this year.

Bio

Shuaiwen Leon Song is a senior principal scientist and manager at Microsoft. He leads the effort of Deepspeed4Science initiative to create a broad engagement between Microsoft, Microsoft research, DoE labs, academia and industry partners to enable sophisticated system technology research and development for supporting aspects of training and inference for large-scale AI-driven scientific models. At DeepSpeed, he also drives or co-drives several pathfinding projects and releases (e.g., ZeRO inference, scalable dialogue system design and DeepSpeed Chat) and co-manages the Brainwave team. Prior to Microsoft, he was the SOAR associate professor at University of Sydney and an adjunct professor at University of Washington. His past works in HPC have received several best paper nominations and were featured in U.S. DoE research highlights and other media outlets. He was the recipient of several awards including IEEE early-career award for HPC, IEEE mid-career award for scalable computing, Facebook faculty award, Google brain faculty award, Australian most innovative engineer award, AIR global faculty award. He is also an ACM distinguished speaker.

Title

DeepSpeed4Science: Enabling Future Large-Scale Scientific Discovery through Sophisticated AI System Technologies

Invited Talk (Memory and System Software)

As part of our invited talks about the rise of disaggregated memory, we are proud to have invited Patrick Estep from Micron to share his perspective.

Bio

Patrick Estep is a SMTS in the Scalable Memory Systems Pathfinding Group (SMS) of Micron Technology. Mr. Estep received a Master of Science degree in Computer Science from Southern Methodist University, Dallas, TX. Mr. Estep is a member of the Association for Computing Machinery (ACM). He is the inventor of US patents 10,042,682, 11,720,475, 11,740,800, 11,790,790 and 11,802,957.

Title

HPC Software Scaling for ML Using CXL 3.0 GFAM

Abstract

Traditional HPC systems rely on balanced soft scaling, which adjusts the compute-to-memory ratio according to the workload. However, this approach is challenged by Machine Learning applications, especially Large Language Model (LLM) workloads, which demand much more memory than compute. This leads to wasted compute resources and excessive data movement in the system. To address this issue, we propose to use CXL 3.0 Global Fabric Attached Memory (GFAM), which enables independent scaling of compute and memory and reduces data movement. In this talk, we will explore how GFAM architectures require changes in memory and compute placement, as well as software stacks, to optimize performance for LLM workloads.

Invited Talk (Machine Learning)

We are also proud to announce an invited talk about Machine learning from Dr Dong Li from University of California Merced.

Bio

Dong Li is an associate professor at EECS, University of California, Merced. Previously, he was a research scientist at the Oak Ridge National Laboratory (ORNL). Dong earned his PhD in computer science from Virginia Tech. His research focuses on high performance computing (HPC), and maintains a strong relevance to computer systems. The core theme of his research is to study how to enable scalable and efficient execution of enterprise and scientific applications (including large-scale AI models) on increasingly complex parallel systems. Dong received an ORNL/CSMD Distinguished Contributor Award in 2013, a CAREER Award from the National Science Foundation in 2016, a Berkeley Lab University Faculty Fellowship in 2016, a Facebook research award in 2021, and an Oracle research award in 2022. His paper in SC14 was nominated as the best student paper. His paper in ASPLOS 21 won the distinguished artifact award. He was also the lead PI for the NVIDIA CUDA Research Center at UC Merced. He is an associate editor for IEEE Transactions on Parallel and Distributed Systems (TPDS).

Title

Enabling Large Dynamic Neural Network Training with Learning-Based Runtime Memory Management

Abstract

Dynamic neural network (DyNN) enables high computational efficiency and strong representation capability. However, training DyNN can face a memory capacity problem because of increasing model size or limited GPU memory capacity. Managing tensors to save GPU memory is challenging, because of the dynamic structure of DyNN. We introduce DyNN-Offload, a memory-management runtime system to train DyNN. DyNN-Offload uses a learned approach (using a neural network called the pilot model) to increase predictability of tensor accesses to facilitate memory management. The key of DyNN-Offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNN-Offload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN. DyNN-Offload enables 8x larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-Offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3x and 2.1x respectively in terms of maximum batch size.

Program

Time Paper / Session Title Authors
14:00-14:02 Welcome Barbara Chapman (HPE) [PDF]
14:00-15:00 Distinguished Speaker: DeepSpeed4Science: Enabling Future Large-Scale Scientific Discovery through Sophisticated AI System Technologies Shuaiwen Leon Song, Microsoft
15:00-15:30 Break  
15:30-16:00 Invited Talk: HPC Software Scaling for ML Using CXL 3.0 GFAM Patrick Estep, Micron [PDF]
16:00-16:20 Dask-Extended External Tasks for HPC/ML In Transit Workflows Amal Gueroudji (ANL), Julien Bigot (CEA), Bruno Raffin (INRIA), Robert Ross (ANL) [PDF]
16:20-16:50 Invited Talk: Enabling Large Dynamic Neural Network Training with Learning-Based Runtime Memory Management Dong Li, University of California Merced
16:50-17:10 MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators Chen-Chun Chen, Kawthar Shafie Khorassani, Pouya Kousha, Qinghua Zhou, Jinghan Yao, Hari Subramoni, Dhabaleswar K. Panda (OSU) [PDF]
17:10-17:29 A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory Interface Dawson Fox (UDEL), Jose Monsalve Diaz (ANL), Xiaoming Li (UDEL) [PDF]
17:29-17:30 Closing Oceane Bel (PNNL)