HOME

TheInfoList



OR:

A scientific workflow system is a specialized form of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. International standards There are several international standards ...
designed specifically to compose and execute a series of computational or data manipulation steps, or
workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of ...
, in a scientific application.


Applications

Distributed scientists can collaborate on conducting large scale scientific experiments and
knowledge discovery Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured ( text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must ...
applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision. More specialized scientific workflow systems provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner. Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a
pipeline Pipeline may refer to: Electronics, computers and computing * Pipeline (computing), a chain of data-processing stages or a CPU optimization found on ** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
. A
bioinformatics workflow management system A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics. Ther ...
is a specialized scientific workflow system focused on
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.


Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementatio ...
, using a scripting language such as
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pr ...
with a
command-line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive command (computing), commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invokin ...
, or more recently using open-source web applications such as Jupyter Notebook. There are many motives for differentiating scientific workflows from traditional business process workflows. These include: * providing an easy-to-use environment for individual application scientists themselves to create their own workflows. * providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time. * simplifying the process of sharing and reusing workflows between the scientists. * enabling scientists to track the
provenance Provenance (from the French ''provenir'', 'to come from/forth') is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses ...
of the workflow execution results and the workflow creation steps. By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow
scheduling A schedule or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things are ...
activities, typically considered by
grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from c ...
environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements Scientific workflows are now recognized as a crucial element of the
cyberinfrastructure United States federal research funders use the term cyberinfrastructure to describe research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computi ...
, facilitating e-Science. Typically sitting on top of a
middleware Middleware is a type of computer software that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to implement co ...
layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure, and re-run their analysis and visualization pipelines. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result.


Sharing workflows

Social networking communities such as myExperiment have been developed to facilitate sharing and collaborative development of scientific workflows.
Galaxy A galaxy is a system of stars, stellar remnants, interstellar gas, dust, dark matter, bound together by gravity. The word is derived from the Greek ' (), literally 'milky', a reference to the Milky Way galaxy that contains the Solar System. ...
provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.


Analysis

A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden. Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the Discovery Net system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al. The authors note that introducing program analysis and verification into the
workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of ...
world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g.
Petri net A Petri net, also known as a place/transition (PT) net, is one of several mathematical modeling languages for the description of distributed systems. It is a class of discrete event dynamic system. A Petri net is a directed bipartite graph tha ...
s) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.


Notable systems

Notable scientific workflow systems include: * Anduril, bioinformatics and image analysis * Apache Airavata, a general purpose workflow management system * Apache Airflow, a general purpose workflow management system *
Apache Taverna Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name ''Taverna Workbench'', then a project under the Apache incubator. Taverna allowed users to integrate many ...
, widely used in bioinformatics, astronomy, biodiversity * BioBIKE, a cloud-based bioinformatics platform * Bioclipse, a graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow. * Collective Knowledge, a Python-based general workflow and experiment crowdsourcing framework with
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
API and cross-platform package manager * Common Workflow Language, a community-developed
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Ext ...
-based workflow language, supported by multiple engine implementations. *
Cuneiform Cuneiform is a logo-syllabic writing system, script that was used to write several languages of the Ancient Near East, Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is nam ...
, a functional workflow language. * Discovery Net, one of the earliest examples of a scientific workflow system *
Galaxy A galaxy is a system of stars, stellar remnants, interstellar gas, dust, dark matter, bound together by gravity. The word is derived from the Greek ' (), literally 'milky', a reference to the Milky Way galaxy that contains the Solar System. ...
, initially targeted at
genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
* GenePattern, a powerful scientific workflow system that provides access to hundreds of genomic analysis tools. *
Kepler Johannes Kepler (; ; 27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws ...
, a scientific workflow management system *
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, an open-source data analytics platform *
Pegasus Pegasus ( grc-gre, Πήγασος, Pḗgasos; la, Pegasus, Pegasos) is one of the best known creatures in Greek mythology. He is a winged divine stallion usually depicted as pure white in color. He was sired by Poseidon, in his role as hor ...
, an open-source scientific workflow management system * OnlineHPC, online scientific workflow designer and high performance computing toolkit * Orange, open source data visualization and analysis * Pipeline Pilot, graphical programming with many tools to address Cheminformatics workflows * Swift parallel scripting language, a scripting language with many of the capabilities of scientific workflow systems built-in. *
VisTrails VisTrails is a scientific workflow management system developed at the Scientific Computing and Imaging Institute at the University of Utah that provides support for data exploration and visualization. It is written in Python and employs Qt via ...
, a scientific workflow system developed in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pr ...
More than 280 computational data analysis workflow systems have been identified, although the distinction between ''data analysis workflows'' and ''scientific workflows'' is fluid, as not all analysis workflow systems are used for scientific purposes.


See also

* Bioinformatics workflow management systems *
e-Science E-Science or eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable di ...
*
Grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from c ...
* Workflow engine


References


External links

*{{cite journal , doi=10.1145/1084805.1084814 , volume=34 , issue=3 , title=A taxonomy of scientific workflow systems for grid computing , journal=ACM SIGMOD Record , page=44, year=2005 , last1=Yu , first1=Jia , last2=Buyya , first2=Rajkumar , citeseerx=10.1.1.63.3176 , s2cid=538714
Scientific workflow systems - can one size fit all?
paper in CIBEC'08 comparing the features of multiple scientific workflow systems.
List of software tools
related to scientific workflows on the
DataONE DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet programs ...
website Workflow applications Workflow system