Because good research needs good data

Workflow Standards for e-Science

By Sarah Higgins, Aberystwyth University 

Published: February 2008

1. Digital Curation and e-Science Workflows

e-Science research relies on the ability to undertake collaborative scientific experiments using large data sets, which are often distributed across different computer networks or data grids. Automated processes, undertaken to obtain successful experiment and simulation results, encompass a number of data actions, including: acquisition, transformation, analysis, annotation, resource discovery, linking and visualisation. These processes need to be logical, structured, reliable, repeatable and verifiable, with the ability to audit results for legal requirements, research evaluation and funding applications.

Workflow techniques ensure that automated procedures are undertaken in the correct sequence, according to a defined set of rules, to achieve an overall goal. Using workflow techniques to model and implement automated experiments can ensure that the experimental objective is achieved and the results have integrity. Using standards to define and execute workflows makes it possible to share them across the e-Science community, and makes them easier to curate. Workflow repositories, both within and across organisations, store, publish and curate authored workflows, enabling research advantages through reuse, annotation, adaption, and ultimately reversioning.

Back to top

2. Standards For e-Science Workflow

Workflow representation standards originate in the business process modelling domain and solutions have been developed by a number of commercial organisations such as IBM and Microsoft. Open Standards are being developed by independent consortia including W3C (The World Wide Web Consortium), OASIS (Organization for the Advancement of Structured Information Standards), the Workflow Management Coalition (WfMC) and the OMG (Object Management Group). Some bodies are focusing effort on the development of suites of complementary standards, while others are developing individual multipurpose standards. There is, as yet, no consensus as to which standards are the most pertinent to e-Science applications and no established framework of standards for e-Science architecture.

Scientific workflow systems are often characterised by describing processes in terms of data flow, rather than the control flow orientation of business workflow. Research comparing the applicability of a variety of workflow standards to e-Science applications is being undertaken, by a number of projects. Some advocate the development of standards specifically designed for e-Science, which take account of the computational and data transfer requirements for large data sets. There is an argument that the divorce of representation methods from their enactment engines, and interoperability across standards, are more important than the actual standards used.

Back to top

3. Functionality

Successful workflow architecture depends on the use of a framework of standards, each ensuring that different processes are planned and executed effectively:

  1. Workflow Design
    • The Workflow Reference Model provides a high level framework to define a workflow management system. The model develops standardised workflow terminology and identifies the required characteristics, components and functions of a workflow system, with the primary goal of achieving interoperability across implementations. It identifies areas where standards can be developed and applied, and how they interact to enable processes to be executed successfully. Published by the WfMC in 1995, primarily to support business processes, the Workflow Reference Model could also be used to model e-Science implementations.
  2. Process Design and Definition
    • Workflow processes need to be visualised, planned and defined, before implementation. Notation languages are characterised by the ability to define abstract models using graphical notation. They allow the description of tasks and activities, and definition of the order in which they are to be executed.
    • UML (Unified Modeling Language) is a general purpose modelling language, which is maintained by the OMG. UML Activity Diagrams can be used to model both the organisational and computational aspects of workflow. Extensions being developed to the standard may strengthen applicability to e-Science workflow by improving semantic and syntactical precision and the types of synchronisation possible.
    • BPMN (Business Process Modeling Notation), also maintained by the OMG, is defined for the business domain. It provides standardised graphical notation and aims to be readily understandable to all business stakeholders — analysts, developers and managers.
    • XPDL 2.0 (XML Process Definition Language) maintained by the WfMC enables the graphics and the semantics of workflow models to be expressed in XML to enable exchange between tools. It can be used with a number of modelling languages, and has been designed to interoperate with BPMN.
    • BPDM (Business Process Definition Metamodel) is another process interchange format, in the final phases of adoption by the OMG. BPDM is semantically well defined, complements UML and interoperates with BPMN.
    • Scufl (Simple Conceptual Unified Flow Language) is a combined notation and execution language which has been developed by the e-Science domain. The Taverna Workbench, an Open Source authoring and execution environment, allows graphical creation of Scufl workflows prior to enactment.
  3. Process Execution
    • WS-BPEL (Web Services Business Process Execution Language) can be regarded as the industry standard for process execution. The standard is supported by a number of software application vendors, including the Open Source applications Netbeans, Eclipse/IBM and ActiveEndpoints. It has been mapped to BPMN for graphical notation, although differences in the languages can limit success. The OMII-BPEL Project, undertaken by UCL (University College London), established the suitability of the standard for developing and executing scientific workflows in a Grid computing environment.
    • WS-BPEL is a vocabulary rich XML based orchestration language, which is maintained by OASIS. It allows both abstract processes (those interacting with external entities) and executable processes (those undertaken internally) to be undertaken. It has good support for multiple namespaces and use of expressions. WS-BPEL was originally known as BPEL4WS and supersedes IBM's WSFL (Web Services Flow Language) and Microsoft's XLANG.
    • Abstract processes within WS-BPEL use Web Service operations. These simple, self-contained applications enable standardised automated requests and business functions using Internet protocols such as SOAP (Simple Object Access Protocol), UDDI (Universal Description Discovery and Integration) and XML. Web services within WS-BPEL are defined using the XML metadata language WSDL (Web Services Description Language).
    • Scufl has been successfully adopted for sharing workflows across the e-Science community. Part of the myGrid Project, the language is scripted using the Taverna Workbench, which also provides two enactors, and is expressed in XML using the XScufl schema. It has the advantage of accessible and straightforward Open Source tools, making it easy to author, and good support for creating nested workflows. Enacted in Taverna, it is extensible to support custom plug-ins and local Java applications.
    • YAWL (Yet Another Workflow Language) was developed by Eindhoven University of Technology and Queensland University of Technology. It was developed using Petri-nets as the starting point, with the aim of using formal semantics to support all Workflow Patterns identified by the Workflow Patterns Initiative. This Open Source, concise, expressive, graphical language handles complex data, transformations and integration of Web Service standards, and can be authored using the YAWL editor. Like Scufl, YAWL originates in the higher education sector, is not maintained by a standards organisation and does not have IT industry support.
  4. Process Choreography
    • This is the specification of the ordering of potential interactions between two or more business entities in peer-to-peer collaboration. This is distinct from orchestration, which assumes process execution to have a single perspective and endpoint. Choreography can be achieved using different Web Services XML specifications maintained by the W3C. WS-CDL (Web Services Choreography Description Language) enables interoperability across peer-to-peer implementations. WSCL (Web Services Conversation Language), originally developed by Hewlett-Packard, defines the minimum set of concepts necessary to describe conversation tasks — the abstract interfaces required by Web services. WSCI (Web Services Choreography Interface) can be used for choreography interface description.

Back to top

4. Implementations

  • The LEAD Portal The Lead (Linked Environments for Atmospheric Discovery) Project is collaboration between nine US institutions, based at University of Oklahoma. The project aims to improve severe weather forecasting by making disparate meteorological data, forecast models and analysis and visualisation tools available for interactive exploration of the weather. It uses grid enabled web services framework for workflow orchestration and data management. Workflow activity is logged, assigned metadata and stored for reuse. The application uses a modified form of WS-BPEL, which allows dynamic and adaptive implementation, to orchestrate complex interactions.
  • MyExperiment The MyExperiment virtual research environment is a social networking site for scientists, which enables workflows and related information to be shared, curated and executed. Scientists are also able to install their own myExperiment servers and federate them. This federated repository allows the encapsulation of scientific objects, to enable all digital material relating to a scientific experiment to be retained as a collection. MyExperiment currently supports sharing and execution of Scufl workflows. Support for other scripting environments, and workflow systems such as Triana and Kepler is being developed.

Back to top

5. Additional Resources

Standards Organisations

Standards and Specifications Documentation

Further Reading

Back to top

6. Related DCC Resources

  • Goble, Carole (2007). "Curating Services and Workflows: the Good, the Bad and the Downright Ugly". Keynote address at the 3rd International Digital Curation Conference, 2007. Slides are available from the Conference Programme page.

Back to top