Research Assistant Professor
Vanderbilt University

PhD Research
I am currently investigating the feasibility of truly online failure prediction methods on Petascale machines. My research directions can be divided into three related parts:

Online failure prediction
DoE Grand “Exascale Software Center” from January 2011

This project focuses on building a fault propagation model by inspecting events generated by large scale systems. Our method is based on the observation that there is a Fault-Errors-Failure propagation graph between notifications given by different components in the system. The fault generates a number of errors that could be observable at the system or application level, which represent either notifications in the log files or changes in performance metrics. The propagation chain ends with the failure which is observed at the application level and usually is represented by an application interruption or extreme performance degradation.

Related publication:
  • Fault prediction under the microscope: A closer look into HPC systems - Ana Gainaru, Franck Cappello, Marc Snir, William Kramer - SC 2012
  • Taming of the Shrew: Modeling the Normal and Faulty Behavior of Large-scale HPC Systems - Ana Gainaru, Franck Cappello, William Kramer - IPDPS 2012
  • Event log mining tool for large scale HPC systems – Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, Bill Kramer – EuroPar 2011

Fault tolerance framework for the BlueWater system
Integrated Systems Console project at NCSA

The aim of this project is to build a framework for the Blue Waters systems where different fault tolerance modules can be inserted. The modules will be able to work in parallel on each service node and will interact with each other through the framework. The main focus of this project is fault detection and prediction by characterizing the state of the system at each moment of interest. With 1 Petaflop of sustained performance, the Blue Waters system puts extra stress on the analysis modules and presents a couple of challenges not encountered in smaller systems.

Related publications:
  • Real Time Analysis and Event Prediction Engine - Joshi Fullop, Ana Gainaru, Joel Plutchak - CUG 2012

Preventive fault tolerance techniques
Work in collaboration with INRIA

This project focuses on the study of different production HPC systems and investigates the best way to model failure distributions for different components. Based on this model, we analyze different preventive fault tolerance solutions with a focus on a combination of preventive and proactive checkpointing.

Related publications:
  • Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing - Mohamed Slim Bouguerra, Ana Gainaru, Franck Cappello, Leonardo Bautista Gomez, Naoya Maruyama, Satoshi Matsuoka - IPDPS 2013
  • Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems - Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, Franck Cappello - SC 2011

HELO (Hierarchical Event Log Organizer)

HELO is a tool used for automatically analyzing log files generated by HPC systems and extracting all existing event formats. HELO has been integrated in the Blue Waters and communicates with SyslogNG.

HELO has two components:
  • An offline template extraction module that inspects historic log files and extracts message patterns that describe different event types (called templates). The templates are similar to regular expressions and they use three types of wildcards: * for random words, d+ for numeric values and n+ for a series of one or more different words. Example template: Accepted publickey for * from * port d+ ssh2 n+

  • An online classification module that deals with an incoming stream of events. For each message it looks for the group that best fits the new data. If a template is found, HELO upgrades its description by including the new message. If no template is found HELO creates a new template that contains only the new message.

The templates generated by HELO can be used to monitor the behavior of a system (by monitoring, for example, the rate of events of specific types).

HELO is open source under BSD license and can be downloaded from the following links:
HELO offline v1.2
HELO online v1.2 - not yet available (coming soon)

If you use HELO for your research, please reference this paper:
Event log mining tool for large scale HPC systems – Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, Bill Kramer – Euro-Par conference 2011, pages 52-64, Bordeaux, France, 2011.

ELSA (Event Log Signal Analyzer)

ELSA is a signal analysis tool considering system event repetitions as signals. The tool uses HELO's output and, for each template, decides if the event signal is periodic, partial periodic, non-stationary or noise. For each type of signal we use different outlier detection tools to shape the normal behavior of events and to isolate the anomaly moments. ELSA also has modules for correlating the events (both by looking at the abnormal behavior and by looking at all occurrences). ELSA is composed of three components:
  • An offline module that works on historic logs and extracts the possible correlations between events generated by the system. The output of this module is a file with correlation chains.

  • An online prediction module that monitors an incoming stream of events and uses the correlations extracted in the offline phase to make predictions. These predictions are verified and the results are written in the output file.

  • Modules that are integrated in combination with FTI, a multi-level checkpointing strategy. Each node has a dedicated core that runs FTI and ELSA. ELSA is responsible with monitoring the current events, detecting anomalies, making predictions and informing FTI when to take preventive checkpoints.
ELSA is open source under BSD license. It is not available for download yet, but it will be added here soon.

   designed by