I am currently investigating the feasibility of truly online failure prediction methods on Petascale machines. My research directions can be divided into three related parts:
Online failure prediction
DoE Grand “Exascale Software Center” from January 2011
This project focuses on building a fault propagation model by inspecting events generated by large scale systems. Our method is based on the observation that there is a Fault-Errors-Failure propagation graph between notifications given by different components in the system. The fault generates a number of errors that could be observable at the system or application level, which represent either notifications in the log files or changes in performance metrics. The propagation chain ends with the failure which is observed at the application level and usually is represented by an application interruption or extreme performance degradation.
Fault tolerance framework for the BlueWater system
Integrated Systems Console project at NCSA
The aim of this project is to build a framework for the Blue Waters systems where different fault tolerance modules can be inserted. The modules will be able to work in parallel on each service node and will interact with each other through the framework. The main focus of this project is fault detection and prediction by characterizing the state of the system at each moment of interest. With 1 Petaflop of sustained performance, the Blue Waters system puts extra stress on the analysis modules and presents a couple of challenges not encountered in smaller systems.
Preventive fault tolerance techniques
Work in collaboration with INRIA
This project focuses on the study of different production HPC systems and investigates the best way to model failure distributions for different components. Based on this model, we analyze different preventive fault tolerance solutions with a focus on a combination of preventive and proactive checkpointing.
HELO (Hierarchical Event Log Organizer)
HELO is a tool used for automatically analyzing log files generated by HPC systems and extracting all existing event formats. HELO has been integrated in the Blue Waters and communicates with SyslogNG.
HELO has two components:
The templates generated by HELO can be used to monitor the behavior of a system (by monitoring, for example, the rate of events of specific types).
HELO is open source under BSD license and can be downloaded from the following links:
HELO offline v1.2
HELO online v1.2 - not yet available (coming soon)
If you use HELO for your research, please reference this paper:
Event log mining tool for large scale HPC systems – Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, Bill Kramer – Euro-Par conference 2011, pages 52-64, Bordeaux, France, 2011.
ELSA (Event Log Signal Analyzer)
ELSA is a signal analysis tool considering system event repetitions as signals. The tool uses HELO's output and, for each template, decides if the event signal is periodic, partial periodic, non-stationary or noise. For each type of signal we use different outlier detection tools to shape the normal behavior of events and to isolate the anomaly moments. ELSA also has modules for correlating the events (both by looking at the abnormal behavior and by looking at all occurrences). ELSA is composed of three components: