Overview

Qorus workflows should always target comprehensive error recoverability, so that workflows can handle any recoverable error by design, meaning that, assuming that the input data are correct and that end systems and network transports are available (or become available within the retry period(s) defined by the workflow and server settings), the workflow will complete successfully even in case of errors.

The following sections describe the design and implementation constraints for a workflow to meet these requirements, the common errors that must be dealt with, and how to deal with them.

Design and Implementation Constraints

The following are some examples of conditions that must be addressed for a workflow to meet comprehensive error recoverability requirements:

Requirement 1: Steps Perform One Atomic Action: Steps must be designed to perform a single atomic action (can be multiple actions that occur in sequence and are only ever executed together, even in the case of errors). Otherwise the workflow cannot be recovered properly in the case of errors.

If a step performs more than one action, and one of the actions fails, then the Qorus system cannot ensure that the workflow will be restarted at the point of failure when recovering, because the step is the lowest restartable element in a workflow. In this case, adjust your design by splitting each atomic action into a separate step.

Requirement 2: End Systems Must Remain Consistent: It must not be possible for conditions out of Qorus's control to cause inconsistencies in end systems such that no further action can be taken on the workflow's data and the workflow stalls.

For example, non-repeatable functions in end systems called by Qorus workflows must be implemented so that they either succeed or are rolled back completely; actions must be atomic. If a condition out of Qorus's control (such as a power failure on a server hosting an application while the application is executing a function called by Qorus) can cause an application to reach an inconsistent state where further actions cannot be taken on the data in question, then that state will have to be corrected before the workflow can continue. To avoid this, all actions in end systems must be atomic.

Requirement 3: Non-Repeatable Steps Must Include Validation: Always include validation code for steps that cannot be repeated. Either a transport layer failure, application failure, or Qorus failure could cause the step to be recovered when the action was actually successfully completed in the end system.

For steps using network communication that triggers an action in a remote system that cannot be repeated for the same input data, in the case of lack of answer by the end system, the possibility must be considered that the action was executed, but the response message was lost (for example, due to network problems). In this case, the step should implement validation code to check the end system if the message was successful or not.

Requirement 4: Check the Validity of Input Data: Always check the validity/consistency of input data if inconsistent data is a possibility, and inconsistent data can cause problems in end systems or the proper execution of the workflow.

Inconsistent input data can lead to a situation where a workflow stalls in the middle of execution and can never be recovered.

For example, if inconsistent data are only detected in step five of a ten step workflow, and changes are made in four applications before step five, from a data consistency point of view, in the worst case this could lead to the necessity of manually cleaning up data from the first four systems to back out the workflow's actions, and in the best case represents a probable waste of resources (disk space, etc).

To avoid this, all necessary measures must be taken to ensure the validity of the data before starting the workflow's logic that writes the data to end systems.

This can be done in the attach logic (an attribute of the workflow object), for example, or, if the validity of the data does not depend on changing states in other applications, in the first step of the workflow.

Requirement 5: All Errors Must Be Recognized and Flagged: All responses from end systems must be checked for all possible error conditions.

This point is common sense; in order to avoid the situation where a workflow has an error status, but Qorus reports OMQ::StatComplete, all errors must be recognized and flagged. Generally, Qorus workflows should handle errors as intelligently as possible. Every error that could occur in a workflow that requires an automatic retry must be defined in advance in the error function, and the system behavior should be carefully considered with each error.

Recovery from Complex Error Conditions

This section describes some possible error conditions necessitating the requirements in the previous section.

Unavailability of Transport Layer (ex: network problem)
- in the case of outgoing messages where the message is not received
- in the case of reply messages where the reply from the end system is never received
Unavailability of End System(s) (ex: unplanned application or server restart)
- in the case of an end-application failure when no Qorus-initiated action is taking place
- in the case of an end-application failure during an Qorus-initiated action
Catastrophic Failure of Qorus Server (ex: power outage on server)

By designing and implementing your workflows to the requirements in the previous section, the error conditions above can be covered with no data loss. The following conditions apply principally to Requirement 3 above.

Unavailability of Transport Layer or End Systems

These cases can be recognized by a communications failure (normally a Qore exception) or a message timeout.

In either case, if the workflow's logic cannot determine if the message was processed by the end system before the failure, and the action can only be performed once for the input data in the end system in question, then the error defined by the workflow should have a OMQ::StatRetry status, and validation code must be defined that will check the end system to see if the action was carried out or not.

This can happen, for example, if an HTTP message (or other network message) is sent and a timeout occurs. The timeout could have happened because the message was never received, the message was received and processed, and the response message was lost, or the message was received and an error happened while processing the message that prohibited the response from being sent.

Because this information is critical to the further processing of the workflow/order data, the programmer must define validation code to the step object that will verify the status of the action in the end system before continuing when the step is recovered by Qorus.

A validation code should be used instead of handling the error in the step function itself, because the problem that caused the error could prohibit the validation code from being run successfully (for example with a temporary network outage or an end-application restart). As the validation code is run after the recovery delay (see system options qorus.recover_delay and qorus.async_delay), the chances of successfully determining the status are higher than with trying to handle the error in the step function itself.

Note: Error Handling Belongs in Validation Code, not Primary Step Code: When implementing steps, errors should be flagged (but not handled) in the primary step code; error handling should be implemented in the validation code.

Catastrophic Failure of Database, Qorus Server

While hardly a common problem, Qorus has been designed so that recovery from a system crash (power outage, database outage, etc) is recoverable as long as the database remains consistent (Oracle schema must be recovered to a consistent state).

The Qorus database should always be in a clustered or high-availability configuration in order to ensure database consistency. Qorus's internal design is such that catastrophic failures such as a power failure on the UNIX system hosting the Qorus application can always be recovered to a consistent state and will allow properly- designed workflows to be recovered.

When Qorus recovers a crashed application session, all steps that were OMQ::StatInProgress are set to OMQ::StatRetry. When they are retried, if each step that requires validation code (for example, a non-repeatable step in an end-system) has one, then the workflow can ensure that it can always recover at any point from an Qorus system failure.

See also: Session Recovery for details on Qorus application session recovery

Table of Contents

Overview

Design and Implementation Constraints

Recovery from Complex Error Conditions

Unavailability of Transport Layer or End Systems

Catastrophic Failure of Database, Qorus Server