Tardis: A Fault-Tolerant Design for Network Control Planes

Abstract

Guaranteeing high availability of networks virtually hinges on the ability to handle and recover from bugs and failures. Yet, despite the advances in verification, testing, and debugging, production networks remain susceptible to large-scale failures — often due to deterministic bugs.

This paper explores the use of input transformations as a viable method for recovering from such deterministic bugs. In particular, we introduce an online system, Tardis, for overcoming deterministic faults by using a blend of program analysis and runtime program data to systematically determine the fault-triggering input events and using domain-specific models to automatically generate transformations of the fault-triggering inputs that are both safe and semantically equivalent. We evaluated Tardis on several production network control plane applications (CPAs), including six SDN CPAs and several popular BGP CPAs using 71 realistic bugs. We observe that Tardis improves recovery time by 7.44%, introduces a 25% CPU and 0.5% memory overhead, and recovers from 77.26% of the injected realistic and representative bugs, more than twice that of existing solutions.

Publication
Proceedings of SOSR'21