logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06262024-173238


Tipo di tesi
Tesi di laurea magistrale
Autore
MARKIN, JONATHAN ATO
URN
etd-06262024-173238
Titolo
Understanding And Modelling The Impact Of AI Frameworks On The Message Passing Interface (MPI) and The Omni-Path Express (OPX)
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Dazzi, Patrizio
relatore Prof. Bacciu, Davide
relatore De Caro, Valerio
Parole chiave
  • Distributed Training
  • Fabric Interconnect
  • High-Performance Computing
  • HPC
  • Message Passing Interface
  • MPI
  • Omni-Path Express
Data inizio appello
12/07/2024
Consultabilità
Tesi non consultabile
Riassunto
With the rapid pace of technological advancement and the growing demand for highly scalable and efficient applications, distributed and high-performance computing have become essential for handling complex computation tasks. Distributing workloads across multiple compute units or nodes is a highly effective approach to meeting these demands.
Artificial Intelligence (AI) and its applications are now central to technology and daily life. AI researchers have adopted distributed training to accelerate model training across clusters of compute nodes. However, this approach often faces latency issues due to communication between nodes. While distributed training is powerful and speeds up processes, inter-node communication remains a common concern.
Cornelis Networks, a leader in high-performance fabrics, has introduced Omni-Path Express, a high-speed interconnect designed for low latency and high message rates. Testing Omni-Path Express on AI model training using various frameworks is essential to assess its impact on distributed training, particularly using MPI (Message Passing Interface) as the communication backend.
This research focuses on AI model training and inferencing tasks on an HPC cluster utilizing Omni-Path Express as the fabric interconnect and MPI for communication between nodes. It evaluates the impact of distributed training on such a cluster and aims to analyze the effects on the Omni-Path Express and the MPI layer.
File