|
Download
People
Projects
Publications
Related
Talks
Internal
|
Vivo: Systematic Approaches to
High Availability in Cluster-Based Services
Network services, and in
particular popular
Internet services, frequently rely on large clusters of commodity
computers as their supporting infrastructure. These services must exhibit several important characteristics, including high performance, scalability, and availability.
Also, their inherent complexity presents a tremendous challenge to their human
operators, and likely explains the large number of outages attributed to human
mistakes in these services. While the performance and scalability of
cluster-based servers have been studied extensively in the literature, in contrast, understanding designs for availability, behavior during
component failures, and the relationship between performance and
availability of these servers have received much less attention.
Primary goals of our work are: (1) to develop a methodology and
metrics for evaluating and quantifying the expected behaviors of
cluster-based servers under realistic deployment conditions, where
faults are an unavoidable fact of life, (2) to thoroughly understand the
behavior of cluster-based servers in the presence of component
failures and mistakes by human operators, (3) define a server design and evaluation methodology that can
produce extremely available and high performing cluster-based servers, and to
(4) develop novel techniques to attack critical sources of unavailability within
these services.
Our longer term goal is to enable the development of Internet services
that achieve the same level of availability as services that are so
available that we take them for granted, such as the wired telephone
system.
|
To date, we have
-
Designed and implemented a comprehensive,
cluster-based fault-injection and network emulation infrastructure
called Mendosus, to help service designers and testers to evaluate
live systems' responses to faults.
-
Developed a methodology
that combines fault-injection, measurement of live system performance,
and analytic modeling to quantitatively assess/predict services'
performability (performance + availability). This methodology has been tested
with multiple case studies of cluster-based services.
-
Conducted a large number
of human operator experiments and characterized the mistakes they made
while performing common tasks within a multi-tier cluster-based
Internet service environment. All results from this study can be got
from here.
-
Developed three validation
techniques and supporting framework to reduce the number of operator
mistakes that are exposed to the service interface and hence
users.
Vivo is
partially supported by:
Operator-Proof Systems Management .
NSF (2009-2012).
Guiding and Validating Operator Mistakes in Internet Services.
NSF (2005-2008).
System
and Compiler Support for Component-Based Construction of Scalable
Internet Services. NSF (2001-2004).
Service Differentiation, Efficiency, and Scalability in Distributed
E-Commerce Systems. NSF (2001-2004).
Download
- Results from the operator study published in our paper, "Understanding
and Dealing with Operator Mistakes in Internet Services", can be
found here .
- Source distribution of Mendosus, our
fault-injection and network emulation tool, is available.
Projects
Mendosus: A Fault-Injection
Test-Bed for the Construction of Highly Available Network Services
People
Students
Faculty
Former
Students
Publications
Barricade: Defending Systems Against Operator Mistakes. Fábio Oliveira, Andrew Tjang,
Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Proceedings of EuroSys 2010, April, 2010.
(PDF)
Model-Based Validation for Internet Servicess. Andrew Tjang, Fábio Oliveira,
Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Proceedings of the 28th IEEE Symposium on Reliable Distributed Systems (SRDS), September 2009.
(PDF)
JustRunIt: Experiment-Based Management of Virtualized Data Centers.
Wei Zheng, Ricardo Bianchini, G. J. Janakiraman, J. R. Santos, Y. Turner. Proceedings of
the USENIX Annual Technical Conference, June 2009.
(PDF)
Quantifying and Improving the Reliability of Distributed Storage Systems. Rekha Bachwani, L. Gryz, Ricardo Bianchini, and C. Dubnicki Proceedings of the 27th IEEE Symposium on Reliable Distributed Systems (SRDS), October 2008.
(PDF)
HAL: Towards Operator-Proof Systems Management.
Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Abstract for poster at the 21st Symposium on Operating Systems Principles (SOSP '07), October 2007.
(PDF)
Automatic Configuration of Internet Services.
Wei Zheng, Ricardo Bianchini, Thu D. Nguyen. In Proceedings of EuroSys 2007, March 2007.
(PDF)
A: An Assertion Language for Distributed Systems.
Andrew Tjang, Fábio Oliveira, Richard P. Martin, Thu D. Nguyen. In Proceedings of
the Workshop on Linguistic Support for Modern Operating Systems (PLOS '06) - (co-located with ASPLOS XII), October 2006.
(PDF)
Understanding and Validating Database System Administration.
Fábio Oliveira, Kiran Nagaraja, Rekha Bachwani,
Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of
the USENIX Annual Technical Conference, June 2006.
(Postscript)
Model-Based Validation for Internet Servicess. Fábio Oliveira, Andrew Tjang, Kiran Nagaraja,
Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Technical Report DCS-TR-601, Department of Computer Science, Rutgers University, May 2006 (Revised October 2008).
(PDF)
Model-Based Validation for Dealing
with Operator Mistakes. Kiran Nagaraja, Andrew Tjang, Fábio Oliveira,
Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Abstract for
poster at the 20th Symposium on Operating Systems Principles (SOSP '05), Oct
2005.
(PDF)
Human-Aware Computer System Design.
Ricardo Bianchini, Richard P. Martin, Kiran Nagaraja, Thu D.
Nguyen, Fábio Oliveira. In Proceedings of the 10th Workshop on
Hot Topics in Operating Systems (HotOS), June 2005.
(PDF)
Quantifying the Performability of
Cluster-Based Services. K. Nagaraja, G. Gama,
R. Bianchini, R. P. Martin, W. Meira Jr., and T. D. Nguyen.
IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No.
5, May 2005.
(PDF)
Understanding and Dealing with
Operator Mistakes in Internet Services. Kiran Nagaraja, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, Thu D.
Nguyen. In Proceedings of the 6th Symposium on Operating Systems Design
and Implementation (OSDI '04),
San Francisco, CA, Dec, 2004.
(PDF)
State Maintenance and Its Impact on the Performability of Multi-tiered
Internet Services G. Gama, K. Nagaraja, R. Bianchini, R. P. Martin, W. Meira Jr.and T. D.
Nguyen. In Proceedings
of the 23rd Symposium on Reliable Distributed Systems (SRDS),
Florianopolis, Brazil, October 2004.
(PDF)(Postscript)
Testing of Java Web Services for Robustness.
Chen Fu, Barbara Ryder, Ana Milanova, David Wonnacott. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA
2004), Boston, MA, July, 2004.
(Postscript)
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services.
Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.
In Proceedings of
SC-2003, Phoenix, AZ, November, 2003.
(PDF) (Postscript)
Compiler Directed Program-fault Coverage for Highly Available Java
Internet Services. Chen Fu, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen,
Barbara G. Ryder, David G. Wonnacott. In Proceedings of the International Conference on Dependable
Systems and Networks (DSN, IPDS track), San Francisco, CA, June
2003.
(PDF)
Using Fault Model Enforcement to Improve Availability.
Kiran Nagaraja, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. In Proceedings of the Second
Workshop on Evaluating and Architecting System dependabilitY (EASY
'02),held
in conjunction with ASPLOS-X , San Jose, CA, October, 2002.
(PDF) (Postscript)
Evaluating the Impact of Communication Architecture on the
Performability of Cluster-Based Services. Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P. Martin,
Thu D. Nguyen. In Proceedings of the Ninth
International Symposium on High Performance Computer Architecture (HPCA
'03),Anaheim, CA, February
2003.
(PDF) (Postscript)
Results of fault injection experiments on PRESS web server. The graphs show performance of the web server under a wide set of faults injected using Mendosus.
Using Fault Injection and Modeling to Evaluate the Performability of
Cluster-Based Services. Kiran Nagaraja, Xiaoyan Li, Bin Zhang, Ricardo Bianchini, Richard P.
Martin, Thu D. Nguyen. Proceedings
of the 4th USENIX Symposium on Internet Technologies and Systems (USITS
'03),
Seattle, WA, March 2003.
(PDF) (Postscript)
Mendosus: A SAN-based Fault-Injection Test-Bed for
Construction of Highly Available Network. Services Xiaoyan Li, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen and Bin
Zhang. In SAN-1 - Novel Uses
of System Area Networks - held in conjunction with HPCA'02,
Cambridge, MA, Feb 2002
(PDF) (Postscript)
Improving Cluster Availability Using Workstation Validation
Taliver Heath, Richard P. Martin, Thu D. Nguyen. In SIGMETRICS 2002, Marina Del
Rey, CA, June 2002
(PDF)
Performability Modeling and Analysis of Fault Tolerance
Support in Communication Protocol. S. Kaur, R. Martin, and T. D. Nguyen.
DCS-TR-426, Department of Computer Science, Rutgers University,
Technical Report (MS thesis).
(.ps.Z)
Using Distributed Data Structures for Constructing
Cluster-Based Services. Richard P. Martin, Kiran Nagaraja, T. D. Nguyen. In Proceedings of the First Workshop on Evaluating and Architecting
System dependabilitY (EASY), Gothenburg, Sweden, July 2001
(Postscript)
Talks
Understanding and Dealing with Operator Mistakes
in Internet Services
@ OSDI '04 -
Operating Systems Design and Implementation, San Francisco, CA,
December 2004.
(PowerPoint)
Using Fault Injection and Modeling to Evaluate the Performability of
Cluster-Based Services
@ USITS'03 - 4th USENIX Symposium on Internet Technologies and Systems,
Seattle, WA, March 2003.
(PowerPoint)
Evaluating
the Impact of Communication Architecture on the Performability of
Cluster-Based Services
@ HPCA-9 - Ninth
International Symposium on High Performance Computer Architecture.
Anaheim, California, February 8-12, 2003
(PowerPoint)
Using Fault Model Enforcement to Improve
Availability
@ EASY 2002
- Second Workshop on Evaluating and Architecting System dependabilitY -
held in conjunction with ASPLOS-X , San Jose , October, 2002
(PowerPoint)
Mendosus:
A SAN-based Fault-Injection Test-Bed for Construction of Highly
Available Network Services
@ SAN-1 -
Novel Uses of System Area Networks - held in conjunction with HPCA'02,
Cambridge, MA, Feb 2002
(PowerPoint)
|