A Review on Fault Tolerance Techniques for High Performance Computing


Received: 24 April 2021
Revised: 26 May 2021
Accepted: 27 June 2021

Ahmad Fadaei Tehrani, Faramarz Safi

  Download Full Text


Abstract

Cloud computing is the next generation computing. There are new capacity and flexibility to HPC (High Performance Computing) applications with using large number of virtual machines for computational intensive applications. Today’s high performance computing systems are typically managed and operated by individual organizations in private. A cloud-based Infrastructure-as-a-Service (IaaS) approach for high performance computing applications promises cost savings and more flexibility. High performance computing (HPC) systems may fail because of large workload and number of servers. Fault tolerance techniques allow HPC systems on cloud to execute computational intensive application with multiple of nodes. Fault tolerance can provide best performance of tasks in the presence of hardware and software faults. However, main failures are mostly hardware based. Also, system availability is very important and fault tolerance techniques used to detect and predict faults. This paper gives an overview on most popular fault tolerance techniques in HPC, prediction models and tools used in HPC.

Keywords: High Performance Computing, Reactive Fault Tolerance, Proactive Fault Tolerance, Predictions models, Artificial Intelligent Computing, Time series models.


  Download Full Text


454
Related Content

Rosepub - Journal management system