bstract:
The search for fundamental
principles of fault tolerance in human-engineered complex dynamic systems is
very new. The physics of individual failure in a component cannot
sufficiently explain the pathological behaviors observed in the aggregated
system. Fault propagation in incidents like the failure of the electric
power infrastructure in Western United States in the summer of 1996 [CNN
96, PBS 96] remains
unpredictable except for the search for triggering individual errors.
Complex macroscopic behaviors emerge as a consequence of the nonlinear
dynamics of interactions between linked components. System behavior may
range from strict order to chaos with great sensitivity to initial
conditions embedded in the physics of individual failures. This is known as
the butterfly effect [Mainzer
93].
Many biological and chemical systems exist where microscopic flux and
chaos is offset by macroscopic order. Based on this concept we formulate
analytical notions of pervasive fault tolerance in human-engineered
complex dynamic systems. These systems, although architecturally similar to
physical systems, may structurally be quite different. For example,
component subsystems may not be microscopic particles following the laws of
classical mechanics where causality is deterministic. At the lowest level of
decomposition, however, the macroscopic effect is triggered by single fault
manifestations of emerging physical defects in hardware, an erroneous state
of software or a human operator error. We propose to develop methods for
determining regions of stability by deriving and finding critical values of
physical parameters where the subsequent behavior of the macroscopic system
changes abruptly. Both theoretic and experimental analyses are essential.
For theoretical analyses, we model complex dynamic systems as hybrid
interacting automata whose continuously varying dynamics capture the
physical process at the lowest level of abstraction. Discrete event models
at the higher levels capture the cognitive response of the system to
observed emerging physical phenomena. We have used this concept to utilize
the dynamic structural behaviors of materials in formulating damage
mitigating control algorithms at the system level to enhance the life of
critical mechanical components [Ray
94]. Our broader aim is to formulate analytical models of the higher
level dynamics of component interactions triggered by all types of
individual failures to (i) predict emerging pathological system behavior
from time-series observations of events and their dynamic interactions, and
(ii) formulate adaptive mechanisms to circumvent or mitigate the effects of
pathological behavior.
With the present state of knowledge of macroscopic behaviors in
engineered complex dynamical systems, experimental analysis is essential for
understanding and characterizing pathologies. Guiding principles from
physical systems [Haken 93]
have not been verified in human engineered systems like the Internet [Grossglauser
99]. We propose to undertake a comprehensive characterization of
pathological behaviors, both syntactic and operational, through extensive
experimentation. This will be achieved by analyzing spatio-temporal patterns
in databases of event/action dynamics. Starting with Kott's catalog of
general complex system pathologies [Kott
99], we will use information theoretic and modeling approaches to
iteratively induce classification and refine characteristics of pathologies
as new data from our laboratory experiments are obtained.
Validation of observed pathological patterns through scientifically
designed realistic, medium complexity simulation experiments is essential.
We, therefore, propose a Failure Simulation Network for collaborative
experiments among all participants. High-fidelity physics based models of
components, hardware-in-the-loop, and experimental data will be generated
and maintained here. An open invitation to industry and academia to join
will extend the network to a collaboratory for the scientific community's
interactions with real operational issues and development of objective
criteria for evaluating failures in complex systems.
The proposed research has the broader potential of providing a scientific
basis for engineering dependability in military operations [Trivedi
98]. It envisions a fundamentally new approach to engineering and
operation of complex informational systems for pervasive fault tolerance.
Instead of specifying parameters for worst-case design of components, we
postulate designing these systems by specifying a scalable set of resources
(components) that interact to support evolving operational needs of multiple
defense applications in a dynamic and uncertain environment. Dependability
of operations will be achieved by identifying and mitigating the origins of
disorder through dynamic coordination and control of available system
resources.