Carnegie Mellon University

18-749 (Fault-Tolerant Distributed Systems)

12 units

The course provides an in-depth and hands-on overview of designing and developing fault-tolerant distributed systems. The course covers both the fundamental and advanced concepts of dependability, including replication, atomic multicast, group communication, consistency, checkpointing, transaction processing and fault injection, along with industrial standards and real-world practices for achieving high availability and fault-tolerance. Additional topics include the practical trade-offs and inter-relationships between fault-tolerance and other properties, such as real-time and performance. The lecture concepts are complemented through a semester-long hands-on project that involves the design, implementation and empirical evaluation of a distributed fault-tolerant, high-performance distributed system. To introduce students to the state-of-the-art technologies, the project emphasizes the use of object-oriented middleware, such as CORBA and EJB.

3 hrs. lec., 9 hrs. lab.

Prerequisites: Experience in programming and senior or graduate standing.

Prerequisite for: 18-849