Fault Prediction in Distributed Systems Gone Wild

Marco Canini, Dejan Novaković, Vojin Jovanović, Dejan Kostić

July 2010

PDF

Abstract

We consider the problem of predicting faults in deployed, large-scale distributed systems that are heterogeneous and federated. Motivated by the importance of ensuring reliability of the services these systems provide, we argue that the key step in making these systems reliable is the need to automatically predict faults. For example, doing so is vital for avoiding Internet-wide outages that occur due to programming errors or misconfigurations.

Type

Conference paper

Publication

Proceedings of the 4th ACM SIGOPS/SIGACT Workshop on Large Scale Distributed Systems and Middleware (LADIS'10)

Marco Canini

Professor of Computer Science

My current interest is in designing better systems support for AI/ML and provide practical implementations deployable in the real-world.