Scientific Workflows:
Scientific Computing Meets Transactional Workflows
Munindar P. Singh and Mladen A. Vouk
Department of Computer
Science
North Carolina State
University
Raleigh, NC 27695-7534
+1.919.515.5677 (voice)
+1.919.515.7925 (fax)
singh@ncsu.edu
+1.919.515.7886 (voice)
+1.919.515.7925 (fax)
vouk@adm.csc.ncsu.edu
We maintain an up-to-date version of this paper here.
Abstract
We introduce the idea of Scientific Workflows as an
amalgamation of scientific problem-solving and traditional workflow
techniques. Scientific workflows share many features of business
workflows, but also go beyond them. Many known workflow results and
techniques can be leveraged in scientific settings, and many
additional features of scientific applications can be usefully
deployed in business settings. Scientific workflows promise to become
an important area of research within workflow and process automation,
and will lead to the development of the next generation of
problem-solving and decision-support environments. In the spirit of
this
NSF workshop, we focus on the conceptual aspects.
1. Introduction
Workflows have drawn an enormous amount of attention in the
databases and information systems research and development communities
[Elm92], [Geo95], [Hsu93]. Over 100 workflow products of various
shapes and sizes now exist. Much of the recent research interest in
workflows has been focused on workflows as they arise in business
environments, e.g., [Ell79]. The products too
are geared to enterprise computing, e.g., [Ley94]. Although business workflows deserve the
attention they are receiving, another class of workflows emerges
naturally in sophisticated scientific problem-solving
environments. We believe this class of workflows, which we dub
scientific workflows, will become ever more important
as computing expands into the routine activities of scientists.
Indeed, there are compelling reasons why scientific workflows should
be of significance to the research community:
- Although business applications are important, some of the heaviest
users of computing are in the sciences.
- The sciences are becoming increasingly computation-intensive. It
is no longer possible for scientists to carry out their day-to-day
activities without heavy use of computing. This holds in fields and
problem areas as diverse as computational biology, chemistry,
genetics, electrical utility management, and reasoning about the
environment.
- Scientific workflows, as we understand them, are crucial to the
success of major initiatives in high-performance computing. As
parallel computing expands, systems such as
PVM and standards such as
MPI
encourage scientists to construct complex distributed solutions
that span the networks Bal94,
and through web-based interfaces invite
incorporation into still more complex systems that may include
interactions with economic and business flows. Workflows represent
the logical culmination of this trend. They provide the necessary
abstractions that enable effective usage of computational resources,
and development of robust problem-solving environments that marshal
high-performance computing resources.
- Scientific techniques can be generalized and marshaled for
business workflows. These include process simulation techniques [Elm95].
Section 2 defines scientific workflows and discusses their
similarities with, and differences from, business workflows. Section
3 highlights some of the key research challenges in scientific
workflows. Section 4 describes two of our recent prototype systems
that incorporate scientific workflow concepts, and shows how they
might be synthesized into a powerful theory and tools for scientific
workflow management.
We emphasize that the principal aim of this document
is to identify the key issues, and some promising ways of thinking
about them, rather than to present complete solutions.
2. What are Scientific Workflows?
We use the term scientific workflows as a blanket term to
describe series of structured activities and computations that arise
in scientific problem-solving. In many science and engineering areas,
the use of computation is not only heavy, but also complex and
structured with intricate dependencies. Graph-based notations, e.g.,
generalized activity networks (GAN), are a natural way of representing
numerical and human processing [Den96, Elm95, Elm66]. These
structured activities are often termed studies or
experiments. However, they bear the following similarities
to what the databases research community calls workflows.
- Scientific problem-solving usually involves the invocation of a
number and variety of analysis tools. However, these are typically
invoked in a routine manner. For example, the computations involve
much detail (e.g., sequences of format translations that ensure that
the tools can process each other's outputs), and often routine
verification and validation of the data and the outputs. As
scientific data sets are consumed and generated by the pre- and
post-processors and simulation programs, the intermediate results are
checked for consistency and validated to ensure that the computation
as a whole remains on track.
- Semantic mismatches among the databases and the analysis tools
must be handled. Some of the tools are designed for performing
simulations under different circumstances or assumptions, which must
be accommodated to prevent spurious results. Heterogeneous databases
are extensively accessed; they also provide repositories for
intermediate results. When the computation runs into trouble,
semantic rollforward must be attempted; just as for business
workflows, rollback is often not an option.
- Many large-scale scientific computations of interest are
long-term, easily lasting weeks if not months. They can also involve
much human intervention. This is especially so during the early
stages of process (workflow) design. However, as they are debugged,
the exceptions that arise are handled automatically. Thus, in the
end, the production runs frequently require no more than semiskilled
human support. The roles of the participating humans involved must be
explicitly represented to enable effective intervention by the right
person.
- The computing environments are heterogeneous. They include
supercomputers as well as networks of workstations and supercomputers.
This puts additional stress on the run-time support and management.
Also, users typically want some kind of a predictability of the time
it would take for a given computation to complete. Making estimates
of this kind is extremely complex and requires performance modeling
of both computational units and interconnecting networks.
Consequently, it is appropriate to view these coarse-granularity,
long-lived, complex, heterogeneous, scientific computations as
workflows. Although we introduce the term scientific
workflows here, we emphasize that the activities this
term covers are to a large extent already carried out by
practitioners in scientific computing. However, by describing these
activities as workflows, we hope to bring to bear on them the advanced
techniques being developed in workflows research. These include
sophisticated notions of workflow specification and of toolkits and
environments for describing and managing workflows. In this way,
scientific workflows will be to problem-solving environments what
business workflows are to enterprise integration. Further, by making
the connections explicit, we also hope to draw upon research in
software process modeling and engineering [Cur92], which has no obvious correlate at an
appropriately high level in scientific computing.
3. Challenges
Scientific workflows go beyond business workflows. Therefore, it
stands to reason that existing tools, which are inadequate for
business settings, would also be inadequate for scientific workflows.
Certain research challenges must be surmounted in order for scientific
workflow management to become practicable.
We identify two classes of such challenges. The first category
applies to workflows in general. This category includes the usual
issues, such as i) handling exceptions, ii) handling the roles of
different participants and allowing the role bindings to change, iii)
declarative specification of control and data flow, and iv) automatic
execution and monitoring of workflows to meet stated specifications.
These issues are leveraged for scientific computing when workflows
techniques are applied there.
The second category is specific to scientific workflows and
includes the features required for scientific computations, but which
may not be adequately addressed in traditional workflows research.
This category includes issues such as i) the ability to handle a vast
number and variety of analysis tools, not just database systems, ii)
interfacing to a diverse array of computational environments including
supercomputers, and iii) the ability to handle activity mixes that are
different from typical business profiles. These would be extensions
to current workflow research, but would pay off ultimately in future
business applications.
Scientific workflows often begin as research workflows
and end up as production workflows. Early in the lifecycle,
they require considerable human intervention and collaboration; later
they begin to be executed increasingly automatically. Thus in the
production mode, there is typically less room for collaboration at the
scientific level and the computations are more long-lived. This
happens partly because of limitations of the available technology. We
speculate that if true workflow technology were available to manage
scientific computations, there would be a reduced push to automate
everything and the quality of the solutions obtained could be improved
by involving the right humans at the appropriate places.
Be that as it may, during the research phase, scientific workflows
need to be enacted and animated far more intensively than business
workflows. In this phase, which is more extensive than the
corresponding phase for business workflows, the emphasis is on
execution with a view to design, and thus naturally includes iterative
execution. The corresponding activity can be viewed as a correlate of
business process engineering. For this reason, the approaches for
constructing, managing, and coordinating process models will find
useful application in scientific settings, if only the main problems
are cast appropriately. Also, the techniques for animation and
enactment can be fed into business process design. Thus ideas from
process modeling can be incorporated, but because of the intensity of
the tasks and the stress on enactment, those ideas must be realized
using general workflow techniques. In the production phase, there is
still need for human intervention, more than present scientific
environments can support. True extensions will be attained by
extending scientific environments to use workflow techniques, rather
than by restricting them to fully automatic distributed systems.
Some of the features that scientific workflows need and can be
imported from classical workflow paradigms are i) succinct and
natural, declarative specification of the workflows themselves, ii)
high-level views of the computations, iii) elegant incorporation of
human decisions into the process, and iv) coordination and
synchronization with other scientific and business workflows.
On the other hand, characteristics that go beyond business
workflows are also important. These include i) preponderance of
analysis tools relative to databases, ii) relative uniqueness of each
workflow, particularly during the research phase when there is less
opportunity to use canned or "normal" solutions, iii) explicit
representation of knowledge needed at different stages, iv)
auditability of the computations when their results are used to make
decisions that carry regulatory or legislative implications.
Considerable progress has been made both in i) the implementation
of complex systems of scientific computations, and ii) workflow
specification and scheduling. Despite this, there is currently no
unified theory or system that formalizes scientific workflows as
defined above.
4. Prototypes and Research Directions
This has motivated us to attempt to fuse our experiences from the
scientific problem-solving and computational community with those from
the workflow community. Following are descriptions of two recent
independent efforts in these areas that we have been engaged in. One
system is used for specifying and scheduling arbitrary workflows, the
other for enacting and managing scientific computations. We show
how their limitations with respect to scientific workflows may be
addressed by unifying them into a cohesive approach that also
accommodates the insights of colleagues at other institutions. This
promises rich rewards in building workflow management systems that can
handle scientific as well as business workflows.
4.1. The MCNC Environmental Decision Support System
Although computations that can be appropriately studied as scientific
workflows arise in a number of areas, we give one specific example so
as to ground our discussion in our experience. We consider the case
of a study management for environmental applications.
The Environmental
Decision Support System (EDSS) is an experimental system
being constructed by
MCNC North Carolina Supercomputing Center (NCSC) in collaboration with
NC State [Amb95, Vou95]. It
involves a study planner, a scheduler, a visualization subsystem,
and an object-oriented repository interconnected by a lightweight
"software bus" [Bal96]. The system provides
access to heterogeneous database systems that, in the near future,
will including a GIS. A study is modeled as a partial order
of program invocations and (possibly) human interventions. The
partial order describes the flow of data from one program to the next.
Each program performs some useful function, such as a simulation,
visualization, or data reduction, and consumes and produces scientific
and other data sets. Although an EDSS prototype is in alpha-testing
phase, full system will require integration of scientific and
appropriate economic and business models and workflows so that
regulatory decisions can be fully qualified. Original EDSS design
plans call for rule-based process flow-control [Coa93]. However, the current graph-based
implementation stops short of that, mostly for lack of an adequate
formal semantic and structural specification framework. We hope that
the theory of scientific workflows will provide this framework.
4.2. Workflow Specification and Scheduling a la Carnot
One of the authors has been involved in the design and implementation
of three workflow specification and execution systems in industry.
The first system was based on an expert system shell [Sin94]; the second was based on temporal logic [Att93]; the third on temporal logic and process
algebra [Sin96a, Sin96b]. All
systems were implemented; the last as a fully distributed one.
What this research provides is a generic facility through which
computations could be structured in terms of selected events of the
constituent tasks. These events are those significant for the
purposes of coordinating the various tasks. Although this research
makes substantial progress in terms of understanding the distributed
events that underlie workflows and gives formal semantics and
scheduling algorithms, it remains at a somewhat low level of
abstraction. It does not provide a view of computations from the
perspective of the user. This visual aspect of the flows is well
covered by the EDSS "Planner", HeNCE [Beg93, Beg94], PVE [Don95], and a variety of
similar facilities. On the other hand, we expect that the extension of
this research into the realm of scientific workflows will provide the
process control specification and enactment framework that EDSS and
similar scientific systems need for full implementation of its
decision support functionalities.
4.3. Proposed Approach
Although the above examples by no means provide an exhaustive survey
of the state of issues in to scientific workflows, they are suggestive
of the kinds of techniques and technologies that are available and the
major challenges that remain to be overcome. Our proposed approach
builds on our previous work by attempting to unify their complementary
strengths. Our objectives can be summarized succinctly as
- The search for abstractions that characterize scientific
computations in problem-solving environments,
- The identification of the key primitives that underlie them,
- The formalization of these primitives, and the implementation of
the formalizations in a semantically honest and general manner on top
of a heterogeneous execution environment.
Our approach involves a notion of partial order of computations,
as is usual. However, unlike current approaches, we allow the partial
order or digraph of computations to be specified dynamically. Thus
all possibilities do not need to be anticipated, but can instead be
encoded more compactly in a rule-based manner and automatically
invoked when necessary. The graph representation, however, enables
analysis and optimization. This can be ignored in many business
settings, but is particularly important in the computation-intensive
applications executed on supercomputers and large networks.
Eventually it will also enable metareasoning about the control
structures with a view to estimating resources required and for
producing optimal execution plans.
5. Conclusions
We have argued that scientific computations are no less structured
or complex than their business or enterprise integration cousins.
Further, scientific workflows are an interesting research concept from
the perspective of databases. One, scientific computations in
problem-solving environments, which are of great importance, have the
key features of workflows and provide a rich testbed on which to apply
workflow ideas. Two, scientific workflows are sufficiently different
from business workflows to merit separate study and will lead to a
number of interesting research problems that have not come up in
traditional business environments.
Computations more similar to scientific than to business workflows
also arise in other applications, e.g., conducting marketing analyses,
producing legal briefs, or performing decision-support analyses in
general. Consequently, many of the research advances made with
scientific workflows will also have ramifications in the broader
segments of decision-making applications. We conjecture that existing
problem-solving and workflow computing environments will be merged
into powerful decision-support environments that will find widespread
use wherever computing is prevalent today.
References
- J. Ambrosiano, R.Balay, C. Coats, A. Eyth, S. Fine, D. Hils, T. Smith, S. Thorpe, T. Turner, and M. Vouk,
"The Environmental Decision Support System: Air Quality Modeling and Beyond", Proceedings of the U.S.
EPA Next Generation Environmental Modeling Computational
Methods (NGEMCOM) Workshop, Bay City,
Michigan, August 7-9, 1995 .
- P.C. Attie, M.P. Singh, A.P. Sheth, and
M. Rusinkiewicz, "Specifying
and Enforcing Intertask Dependencies," Proceedings of the
19th Very Large Databases Conference (VLDB), August 1993.
- R. Balay and V. Wall. "Use of File Transport Wrappers for a HeNCE/PVM Implementation of the Urban
Airshed Model," PVM Users' Group Meeting, Oak Ridge, Tennessee, May 19-20, 1994.
- R. Balay and M. A. Vouk. "A
Lightweight Software Bus for Prototyping Problem Solving
Environments,"
Accepted for the Special Session on Networks and Distributed Systems in the Eleventh International Conference on
Systems Engineering, Las Vegas, 1996.
- Adam Beguelin, J. Dongarra, Al Geist, Robert Manchek, K. Moore, and Vaidy Sunderam,
"
PVM and HeNCE: Tools for Heterogeneous Network Computing,"
Environments and
Tools for Parallel Scientific Computing, Edited by Jack Dongarra and Bernard Tourancheau, Advances in Parallel
Computing, Volume 6, North-Holland, 1993.
- A. Beguelin, J. Dongarra, A. Geist, and R. Manchek,
"
HeNCE: A Heterogeneous Network Computing Environment , "
Scientific Programming, Vol. 3, No. 1, pp 49--60.
- C. Coats, "Classes for the Models-3 System,"
Requirements Documentation for Models3 Project, EPA, January 1993.
- B. Curtis, M.I. Kellner, and J. Over, "Process
Modeling," Communications of the ACM, Vol. 35(9), pp. 75-90, September
1992
- R.L. Dennis, D.W. Byun, J.H. Novak, K.J. Galluppi,
C.C. Coats, and M.A. Vouk, "The Next Generation of Integrated Air Quality
Modeling: EPA's Models-3," Atmospheric Environment, accepted, in
print, expected 1996.
- J. Dongarra and Peter Newton,
"
Overview of VPE: A Visual Environment for Message-Passing Parallel Programming
,"
Heterogeneous Computing Workshop '95, Proceedings of the 4th Heterogeneous Computing Workshop, Santa Barbara, CA, April 25, 1995.
- C. A. Ellis, "Information Control Nets: A
Mathematical Model of Office Information Flow", Proceedings of the
Conference on Simulation, Measurement and Modeling of Computer
Systems, 1979.
- M. Hsu (ed.), "Special Issue on Workflow and
Extended Transaction Systems", IEEE Data Engineering, Vol. 16(2),
June 1993.
- Elmaghraby S.E., "On generalized activity
networks," J. Ind. Eng., Vol. 17, 621-631, 1966.
- A.K. Elmagarmid, "Database Transaction Models for
Advanced Applications", Morgan Kaufmann, 1992.
- Elmaghraby S.E., Baxter E.I., and Vouk M.A., "An
Approach to the Modeling and Analysis of Software Production
Processes," Intl. Trans. Operational Res., Vol. 2(1), pp. 117-135,
1995.
- D. Georgakopoulos, M. Hornick, and A. Sheth, "An
Overview of Workflow Management: From Process Modeling to Workflow
Automation Infrastructure," Distributed and Parallel Databases,
Vol. 3(2), April 1995.
- F. Leymann and W. Altenhuber, "Managing Business
Processes as an Information Resource", IBM Systems Journal,
Vol. 33(2), pp. 326-348, 1994.
- M.P. Singh and M.N. Huhns, "Automating Workflows for
Service Provisioning: Integrating {AI} and Database Technologies,"
IEEE Expert, Vol. 9(1), October 1994.
- M.P. Singh, "Formal
Semantics for Workflow Computations", January 1996.
Extends "Semantical
Considerations on Workflows: Algebraically Specifying and Scheduling
Intertask Dependencies," Proceedings of the 5th
International Workshop on Database Programming Languages (DBPL),
September 1995.
- M.P. Singh, "Synthesizing
Distributed Constrained Events from Transactional Workflow
Specifications," Proceedings of the 12th International
Conference on Data Engineering (ICDE), March 1996.
- M. A. Vouk, R. Balay, and J. Ambrosiano,
"
EDSS - An Environment for Large-Scale Numerical Computing and Decision Making,"
International IFIP/WG 2.5 Workshop on Current Directions in Numerical Software and High
Performance Computing, Kyoto, Japan, October 16-17, 1995.
© Singh & Vouk. All rights reserved. Permission to copy is
granted for research and academic purposes provided this notice is
included intact. Contact address:
singh@ncsu.edu