Download

People

Projects

Publications

Related

Talks

Internal

Vivo: Systematic Approaches to High Availability in Cluster-Based Services

Network services, and in particular popular Internet services, frequently rely on large clusters of commodity computers as their supporting infrastructure. These services must exhibit several important characteristics, including high performance, scalability, and availability. Also, their inherent complexity presents a tremendous challenge to their human operators, and likely explains the large number of outages attributed to human mistakes in these services. While the performance and scalability of cluster-based servers have been studied extensively in the literature, in contrast, understanding designs for availability, behavior during component failures, and the relationship between performance and availability of these servers have received much less attention.

Primary goals of our work are: (1) to develop a methodology and metrics for evaluating and quantifying the expected behaviors of cluster-based servers under realistic deployment conditions, where faults are an unavoidable fact of life, (2) to thoroughly understand the behavior of cluster-based servers in the presence of component failures and mistakes by human operators, (3) define a server design and evaluation methodology that can produce extremely available and high performing cluster-based servers, and to (4) develop novel techniques to attack critical sources of unavailability within these services. 

Our longer term goal is to enable the development of Internet services that achieve the same level of availability as services that are so available that we take them for granted, such as the wired telephone system.


To date, we have

  1. Designed and implemented a comprehensive, cluster-based fault-injection and network emulation infrastructure called Mendosus, to help service designers and testers to evaluate live systems' responses to faults.

  2. Developed a methodology that combines fault-injection, measurement of live system performance, and analytic modeling to quantitatively assess/predict services' performability (performance + availability). This methodology has been tested with multiple case studies of cluster-based services.

  3. Conducted a large number of human operator experiments and characterized the mistakes they made while performing common tasks within a multi-tier cluster-based Internet service environment. All results from this study can be got from here.

  4. Developed three validation techniques and supporting framework to reduce the number of operator mistakes that are exposed to the service interface and hence users. 

Vivo is partially supported by:

Operator-Proof Systems Management .  NSF (2009-2012).

Guiding and Validating Operator Mistakes in Internet Services.  NSF (2005-2008).

System and Compiler Support for Component-Based Construction of Scalable Internet Services. NSF (2001-2004).

Service Differentiation, Efficiency, and Scalability in Distributed E-Commerce Systems. NSF (2001-2004).

Download

  • Results from the operator study published in our paper, "Understanding and Dealing with Operator Mistakes in Internet Services", can be found here .
  • Source distribution of  Mendosus, our fault-injection and network emulation tool,  is available.

Projects

Mendosus: A Fault-Injection Test-Bed for the Construction of Highly Available Network Services


People

Students

Faculty

Former Students


Publications

Barricade: Defending Systems Against Operator Mistakes.  Fábio Oliveira, Andrew Tjang, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  Proceedings of EuroSys 2010, April, 2010.
(PDF)

Model-Based Validation for Internet Servicess.  Andrew Tjang, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  Proceedings of the 28th IEEE Symposium on Reliable Distributed Systems (SRDS), September 2009.
(PDF)

JustRunIt: Experiment-Based Management of Virtualized Data Centers. Wei Zheng, Ricardo Bianchini, G. J. Janakiraman, J. R. Santos, Y. Turner.  Proceedings of the USENIX Annual Technical Conference, June 2009.
(PDF)

Quantifying and Improving the Reliability of Distributed Storage Systems.  Rekha Bachwani, L. Gryz, Ricardo Bianchini, and C. Dubnicki  Proceedings of the 27th IEEE Symposium on Reliable Distributed Systems (SRDS), October 2008.
(PDF)

HAL: Towards Operator-Proof Systems Management. Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  Abstract for poster at the 21st Symposium on Operating Systems Principles (SOSP '07), October 2007.
(PDF)

Automatic Configuration of Internet Services. Wei Zheng, Ricardo Bianchini, Thu D. Nguyen.  In Proceedings of EuroSys 2007, March 2007.
(PDF)

A: An Assertion Language for Distributed Systems. Andrew Tjang, Fábio Oliveira, Richard P. Martin, Thu D. Nguyen.  In Proceedings of the Workshop on Linguistic Support for Modern Operating Systems (PLOS '06) - (co-located with ASPLOS XII), October 2006.
(PDF)

Understanding and Validating Database System Administration. Fábio Oliveira, Kiran Nagaraja, Rekha Bachwani, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  In Proceedings of the USENIX Annual Technical Conference, June 2006.
(Postscript)

Model-Based Validation for Internet Servicess.  Fábio Oliveira, Andrew Tjang, Kiran Nagaraja, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  Technical Report DCS-TR-601, Department of Computer Science, Rutgers University, May 2006 (Revised October 2008).
(PDF)

Model-Based Validation for Dealing with Operator Mistakes.  Kiran Nagaraja, Andrew Tjang, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.  Abstract for poster at the 20th Symposium on Operating Systems Principles (SOSP '05), Oct 2005.
(PDF)

Human-Aware Computer System Design.  Ricardo Bianchini, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen, Fábio Oliveira.  In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS), June 2005.
(PDF)

Quantifying the Performability of Cluster-Based Services.  K. Nagaraja, G. Gama, R. Bianchini, R. P. Martin, W. Meira Jr., and T. D. Nguyen.  IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 5, May 2005.
(PDF)

Understanding and Dealing with Operator Mistakes in Internet Services. Kiran Nagaraja, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04), San Francisco, CA, Dec, 2004.
(PDF)

State Maintenance and Its Impact on the Performability of Multi-tiered Internet Services G. Gama, K. Nagaraja, R. Bianchini, R. P. Martin, W. Meira Jr.and T. D. Nguyen. In Proceedings of the 23rd Symposium on Reliable Distributed Systems (SRDS), Florianopolis, Brazil, October 2004. 
(PDF)(Postscript)

Testing of Java Web Services for Robustness. Chen Fu, Barbara Ryder, Ana Milanova, David Wonnacott. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2004), Boston, MA, July, 2004. 
(Postscript)

Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services. Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of SC-2003, Phoenix, AZ,  November, 2003. 
(PDF)
  (Postscript)

Compiler Directed Program-fault Coverage for Highly Available Java Internet Services. Chen Fu, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen, Barbara G. Ryder, David G. Wonnacott. In Proceedings of the International Conference on Dependable Systems and Networks (DSN, IPDS track), San Francisco, CA, June 2003. 
(PDF)

Using Fault Model Enforcement to Improve Availability. Kiran Nagaraja, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of the Second Workshop on Evaluating and Architecting System dependabilitY (EASY '02),held in conjunction with ASPLOS-X , San Jose, CA,  October, 2002. 
(PDF)
  (Postscript)

Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services. Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA '03),Anaheim, CA, February 2003. 
(PDF)
  (Postscript)

Results of fault injection experiments on PRESS web server. The graphs show performance of the web server under a wide set of faults injected using Mendosus.

Using Fault Injection and Modeling to  Evaluate the Performability of Cluster-Based Services. Kiran Nagaraja, Xiaoyan Li, Bin Zhang, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS '03), Seattle, WA, March 2003. 
(PDF)
  (Postscript)

Mendosus: A SAN-based Fault-Injection Test-Bed for Construction of Highly Available Network. Services Xiaoyan Li, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen and Bin Zhang. In SAN-1 - Novel Uses of System Area Networks - held in conjunction with HPCA'02, Cambridge, MA, Feb 2002 
(PDF)
  (Postscript)

Improving Cluster Availability Using Workstation Validation Taliver Heath, Richard P. Martin, Thu D. Nguyen. In SIGMETRICS 2002, Marina Del Rey, CA, June 2002 
(PDF)

Performability Modeling and Analysis of Fault Tolerance Support in Communication Protocol. S. Kaur, R. Martin, and T. D. Nguyen. DCS-TR-426, Department of Computer Science, Rutgers University, Technical Report (MS thesis). 
(.ps.Z)

Using Distributed Data Structures for Constructing Cluster-Based Services. Richard P. Martin, Kiran Nagaraja, T. D. Nguyen. In Proceedings of the First Workshop on Evaluating and Architecting System dependabilitY (EASY), Gothenburg, Sweden, July 2001 
(Postscript)

Talks

Understanding and Dealing with Operator Mistakes in Internet Services
@ OSDI '04 - Operating Systems Design and Implementation, San Francisco, CA,  December 2004.
(PowerPoint)

Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services
@ USITS'03 - 4th USENIX Symposium on Internet Technologies and Systems, Seattle, WA, March 2003.
(PowerPoint)

Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services
@ HPCA-9 - Ninth International Symposium on High Performance Computer Architecture. Anaheim, California, February 8-12, 2003
(PowerPoint)

Using Fault Model Enforcement to Improve Availability
@ EASY 2002 - Second Workshop on Evaluating and Architecting System dependabilitY - held in conjunction with ASPLOS-X , San Jose , October, 2002
(PowerPoint)

Mendosus: A SAN-based Fault-Injection Test-Bed for Construction of Highly Available Network Services
@ SAN-1 - Novel Uses of System Area Networks - held in conjunction with HPCA'02, Cambridge, MA, Feb 2002
(PowerPoint)