Fault Tolerance ./~course/cs501/2013 Zhi Yang School of EECS, Peking University 3/28/2013 Example: Costs ? As a scaling technique, may not always be applicable. P Access replica N times per second Update replica M times per second ? As a scaling technique, may not always be applicable. What if N << M? ?" Failure is not an option. es bundled with your software .“(--unknown) ?"You know you have [a distributed system] when the crash of puter you've never heard of stops you from getting any work done .“(--Leslie Lamport) Some real-world datapoints Sources: ? Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, Bianca Schroeder and Garth A. Gibson (FAST 07) [ pdf ] ? Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andr é Barroso (FAST ’ 07) [ pdf ] Contents 01: Introduction 02: Architectures 03: Processes 04: Communication 05: Naming 06: Synchronization 07: Consistency & Replication 08: Fault Tolerance 09: Security 10: Distributed Object-Based Systems 11: Distributed File Systems 12: Distributed Web-Based Systems 13: Distributed Coordination-Based Systems 5 /N Outline ? Basic concepts ? Process resilience ? Reliable client-munication (++) ? Reliable munication ? mit (++) ? Recovery Fault handling approaches ? Fault prevention : prevent the occurrence of a fault ? Fault tolerance : build ponent in such a way that it can meet its specifications in the presence of faults (., mask the presence of faults) ? Fault removal : reduce the presence, number, seriousness of faults ? Fault forecasting : estimate the present number, future incidence, and the consequences of faults Design Goal (with regard to fault tolerance): Design a (distributed) system that can recover from partial failures without affecting correctness or significantly impacting overall performance 分布式系统设计出发点?一个进程 P可能依赖不同计算机上其他进程提供的服务,如果那些进程由于出现错误或故障而失去联系,则 P无法正常运行。?计算机死机,或许网络断开,或许对方负载太重,暂时无法
北大 分布式系统FaultTolerance 来自淘豆网m.daumloan.com转载请标明出处.