[Article] Design for Failure - Integration Error Handling Part 2
By CHATHURA KULASINGHE
- 17 May, 2016
Table of contents
- Scenario 2 - Sender does not expect a response with data
- Error possibility 1 - Physical breakdown, after sending acknowledgement
- Error possibility 2 - mediation flow errors
- Error possibility 3 - receiver/outbound endpoint errors
Integration error handling is a vital aspect that any organization today has to deal with. Yet, the concepts or general practices related to integration error handling are not discussed widely; existing relevant content too are hidden behind many web pages that describe features or capabilities provided by different integration platform vendors or products. This article is the second iteration of a series that is intended to identify and address the challenges in this space. If you haven't read the earlier article of this series, it is recommended to go through Design for Failure - Integration Error Handling Part 1 since that consists of the basics of this study.
In the first part, we discussed an integration scenario, which we referred to as Scenario 1, where a particular system sends a request to an integration platform expecting a response with some business related information. In that case, the sender system must receive a response with business data immediately to continue the business operation. In the article, we discussed the basic patterns of handling integration errors that could possibly occur in such cases. This iteration of the article series focuses on a scenario where the sender system in a particular integration sends a request to the integration platform without expecting a response with any business related information. We referred to this as Scenario 2 in the earlier article.
From a business point-of-view, this type of invocation can be considered as one-way communication between the sender-system and the integration platform; in technical terms this can be called a fire-and-forget scenario from the sender-system's point-of-view.
However, in such scenarios, the sender too should at least receive an acknowledgement from the integration-platform to make sure that the request was accepted for processing. Otherwise, the sender-system wouldn't have any knowledge about the messages lost in transit.
Figure 1 above depicts how the integration platform receives a request and sends an acknowledgement to the sender system and then how it starts processing the message.
Even if the integration platform is conceptually depicted using a single rectangle as a unit, it typically is a collection of clusters of many middleware products. In the WSO2 context, the WSO2 Enterprise Service Bus (WSO2 ESB) acts as the heart of WSO2 integration platform and the nodes of an ESB cluster communicate with external systems leveraging transport related capabilities. The above-mentioned request receiving and acknowledgement is done by a particular node of such an ESB cluster.
But, what would happen if that particular ESB node crashes after sending the acknowledgement to the sender?
The sender-system wouldn't worry about the request message anymore since the integration platform officially took the responsibility by sending an acknowledgement. Therefore, the integration engineers need to make sure that this is handled properly at the integration platform's end.
In order to handle this
- The integration platform should persist the message temporarily. The sender would not send the same request again in case of failure. Therefore, if something goes wrong while processing the message, a copy of the original message needs to be available within the integration platform.
- This persistence storage has to be shared across the ESB cluster. If the processing ESB node crashes after sending the acknowledgement any of the remaining nodes can pick the message for processing.
In figure 2 above this persistence unit has been illustrated as an external entity for the purpose of giving the idea that the persistence unit has to be independent of the main processing unit of the solution. Yet, conceptually this persistence unit also needs to be considered as a part of the integration platform in general.
In the WSO2 context, there are 2 approaches of implementing this mechanism using different combinations of integration utilities.
- Proxy-service/REST API/Inbound-endpoint to accept the request
- External storage, such as a JMS message queue to persist messages
- Inbound-endpoint to read messages and start processing (after Ack is sent)
- Proxy-service/REST API/Inbound-endpoint to accept the request
- Message-store to persist messages
- Message-processor to read messages and start processing (after Ack is sent)
We may chose the approach 2 for our example since it also creates an opportunity to discuss the two different features that we did not discuss earlier in this series. Compared with approach 1, this too will ensure ordered delivery of messages.
The implementation of this mechanism is mainly based on the system management enterprise integration pattern called the Message-Store. Additionally, we will also be following two other such integration patterns called the Wire-tap and Message-History for debugging and tracing purposes in integration error handling.
The WSO2 Integration Platform is capable of accepting requests/messages in different message formats through different transport channels (HTTP, JMS, Files, etc). Based on the transport protocol that the sender-system is compatible with, one of the following can be used to within the integration platform to establish a communication channel between these.
- Proxy Service - HTTP/SOAP
- REST API - HTTP/REST
- Inbound Endpoint - JMS, Files
To narrow the scope for this particular sample, we may consider a situation where a sender-system communicates with the integration platform over HTTP. Therefore, in this sample, we may create a mediation flow that
- Accepts requests over HTTP
- Persists requests temporarily in a message-store
- Returns 202 Accepted status message (http acknowledgement) on success, or
- Returns 500 internal server error message on failure
WSO2 Developer Studio is an eclipse-based IDE that has specifically been modified for WSO2 related developments. This sample scenario was developed using the same IDE and the relevant integration project is attached to the article so the relevant integration artifacts and the source code could be found.
Let's start with the fault sequence to be executed when something goes wrong while trying to persist the message.
<sequence name="order_persist_error_sequence" xmlns="http://ws.apache.org/ns/synapse"> <makefault version="soap11"> <code value="soap11Env:Server" xmlns:soap11Env="https://schemas.xmlsoap.org/soap/envelope/"/> <reason value="Not Accepted"/> <role/> <detail>Error occurred while accepting the Request.</detail> </makefault> <respond/> </sequence>
If you have followed the previous article of this series, you would already be familiar with </makefault> mediator and what it does in this case. Therefore, please note the <respond/> mediator, which is placed right below the </makefault> mediator, which performs a special task. Once the <respond/> mediator was added to a particular mediation flow, it directly terminates the currently processing thread and responds to the client (or the sender-system in this case) with whatever the message that's currently available in context (error message created by </makefault> mediator in this case). Any other mediation logic placed after this point will not be invoked.
To model message-store pattern related implementations, the WSO2 Integration Platform is equipped with a logical component called the message-store. The underlying persistence provider of a particular message-store can be an in memory space, a JMS-based message broker cluster, an RDBMS or custom message store implementation. However, from the perspective of the integration platform, the message-store provides an abstraction to the task that it has to perform in this case.
<messageStore name="orders_message_store" xmlns="http://ws.apache.org/ns/synapse"/>
To handle message persistence related activities, let's create a separate sequence as shown below.
<sequence name="order_persist_sequence" onError="order_persist_error_sequence" trace="disable" xmlns="http://ws.apache.org/ns/synapse"> <store messageStore="orders_message_store"/> </sequence>
The already created message-store has been referenced inside the sequence, so that all the incoming messages would be persisted within the message-store. Similarly, the fault sequence that was created earlier has been referenced with the 'onError' attribute of this sequence, so that it would exclusively be invoked when something goes wrong within this sequence.
Now it's time to attach this segment of mediation flow to a proxy service, which would act as the trigger point of the entire execution.
<proxy name="orders_proxy_service" startOnLoad="true" trace="disable" transports="http https" xmlns="http://ws.apache.org/ns/synapse"> <target> <inSequence> <sequence key="order_persist_sequence"/> <property name="FORCE_SC_ACCEPTED" scope="axis2" type="STRING" value="true"/> </inSequence> </target> </proxy>
Inside the inSequence of the above mentioned proxy service, previously created “order_persist_sequence” has been referenced so that it would be invoked once a message is passed across this proxy service mediation flow. If “order_persist_sequence” was executed without errors, the flow will return to the inSequence and the next code line will be executed.
The FORCE_SC_ACCEPTED property is what we use to instruct the internal messaging framework of the integration-platform to respond to the client/sender-system with the 202 accepted status message.
Now, with the mediation logic we have composed
- The proxy service inSeqeunce invokes “order_persist_sequence”.
- “order_persist_sequence” persists requests/messages in “orders_message_store”.
- If something goes wrong, “order_persist_sequence” shall invoke "order_persist_error_sequence".
- "order_persist_error_sequence" shall terminate the mediation flow by responding the client/sender with an error message
- If nothing goes wrong, the mediation flow from the “order_persist_sequence” returns to the main flow (inSequence).
- The mediation engine instructs the messaging framework of the platform to respond to the sender-system with the 202 accepted status message.
From this point onward, we can ignore the sender-system/client because we have made sure that the sender-system will receive a proper response in all possible cases. We have persisted the message internally within the integration platform using a message-store, and the integration platform can pick the messages from the message-store and start processing.
Note: In this sample, we have used an in-memory message-store. In a typical production environment a JMS message broker cluster can be used instead to make this setup more fault tolerant.
To pick the persisted messages form the message-store, and push those into another functional sequence (for mediation and to deliver to the endpoint), a component called the message-processor can be associated with the already created message-store.
There are 4 different message-processor types available with the platform out-of-the-box. For our purpose of picking and delivering the messages to a functional sequence, we will be using a message-processor type called “message-sampling-processor”.
We need to create another sequence including rest of the work/mediation logic, so that it could be invoked by this message-processor that we create.
<sequence name="order_process_sequence" xmlns="http://ws.apache.org/ns/synapse"> <log level="full"> <property name="Located" value="order_process_sequence"/> </log> <drop/> </sequence>
In this sample case, we haven't added many actions to be executed within this sequence, other than logging the entire message with Log mediator and terminating the mediation flow with Drop mediator (however, it needs to be understood that any mediation and delivery of messages would be handled within this sequence in a typical scenario).
Note the 'level' attribute of the log mediator, with the value 'full'. This code segment demonstrates how the WSO2 Integration Platform supports the Wire-tap enterprise integration pattern by providing integration engineers with the capability of logging any message that passes through.
Now, let's create a message-processor and associate that with this sequence and the previously created message-store. Once it's created using WSO2 Developer Studio or the WSO2 ESB management console UI, the relevant code will be created as below.
<messageProcessor class="org.apache.synapse.message.processor.impl.sampler.SamplingProcessor" messageStore="orders_message_store" name="orders_message_processor" xmlns="http://ws.apache.org/ns/synapse"> <parameter name="interval">1000</parameter> <parameter name="sequence">order_process_sequence</parameter> <parameter name="concurrency">1</parameter> <parameter name="is.active">true</parameter> </messageProcessor>
This picks messages from “orders_message_store” and publishes them into “order_process_sequence”, which we have created already.
This order_process_sequence with the log and drop mediators is a sample sequence that represents all or any possible mediation logic. If something goes wrong with the mediation flow at this point, the integration platform needs to hold the entire responsibility and handle it accordingly. Even in cases where the incoming message is corrupted or invalid, there is no way to complain anymore (by sending an error message to the sender), since the integration platform has already issued an acknowledgement terminating the communication channel. Therefore, in the next section, we may look into the patterns we could follow to handle such situations.
A particular mediation flow consists of a series of actions to be executed over a message that passes through. However, there is a possibility that some of these messages would become invalid due to some of the following instances:
- being corrupted
- carrying some malicious contents inside
- following a non-processable format or structure
- or some other similar reasons
In such cases, the mediators within a particular mediation flow may try to execute different actions over such messages and throw errors, disturbing the continuous delivery of service. Hence, such invalid messages should immediately be removed from mediation, and directed to some other channel seeking human intervention.
Invalid-message-channel is an enterprise integration pattern that can be followed in such scenarios.
The proper way of identifying invalid messages and dumping those into an invalid message channel requires performing message validation before start of processing.
The WSO2 Integration Platform comes with a validate mediator that can be used to validate incoming messages against a given schema definition. Otherwise, normal XPath/JSONPath expressions can be used to extract values from the message and perform some validation manually to identify invalid messages.
To implement this invalid-message-channel, the same message-store feature of the integration platform can be used. If required, this message-store (or the invalid message channel) can be associated with a message-processor to pick such messages from this channel and deliver those to some other system or a replaying/sanitizing/reprocessing channel created on the integration platform itself.
To demonstrate this scenario with the WSO2 Integration Platform, we may chose a simple XML payload and its relevant schema definition as follow.
<person> <id>0001</id> <title>Prof</title> <firstname>Walter</firstname> <lastname>White</lastname> </person>
Schema definition (person-schema.xsd)
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="https://www.w3.org/2001/XMLSchema"> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element type="xs:byte" name="id"/> <xs:element type="xs:string" name="title"/> <xs:element type="xs:string" name="firstname"/> <xs:element type="xs:string" name="lastname"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
Now create a local-entry on the ESB, with the above-mentioned schema definition:
<localEntry key="person-schema"> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="https://www.w3.org/2001/XMLSchema"> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="id" type="xs:byte"/> <xs:element name="title" type="xs:string"/> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> </localEntry>
Then create a sequence for mediation, and refer to the above mentioned local-entry with the validate mediator within this sequence.
<sequence name="invalid_message_filter_sequence" trace="disable"> <validate> <schema key="person-schema" /> <on-fail> <log level="custom"> <property name="Detected" value="Invalid Message" /> </log> <store messageStore="invalid_person_records_store" /> <drop /> </on-fail> </validate> <log level="custom"> <property name="Validation" value="is successful" /> </log> <drop /> </sequence>
Note the on-fail scope of the validate mediator and see how the same store mediator has been used to publish invalid messages to the message-store (or invalid-message-channel) called invaild_person_records_store.
In some cases, such a validation mechanism may be seen as an overhead based on specific throughput requirements. Then it is possible to include message-store based persistence logic inside a fault sequence and attach that to the main mediation sequence with its onError attribute to be executed once an error occurs.
This provides a more generalized approach by handling all the messages that would fail due to any type of error. The messages that persisted within such a channel may not be only the invalid messages in all cases; hence, a suitable name such as 'failed-message-channel' could be given for such a channel.
In a typical integration scenario, once a message is processed successfully, it is sent to another connecting system, which is generally called the 'Receiver'. If this receiver system becomes unreachable by the integration platform for any reason, it obviously causes an error failing the flow.
When designing a particular integration, this is another main error possibility scenario that needs to be considered and handled properly.
The dead-letter-channel is an enterprise integration pattern that can be followed to handle this type of situations. If this pattern is described in the most simple manner, this is an implementation of a mechanism that initially tries to deliver a message to a receiver system, and then routes the same message to some other channel, if the message was not delivered to the receiver system successfully. All the undelivered messages would be stacked and will remain within this message channel until the receiver system becomes available again. The re-delivery of the messages can either be performed as a manual operation, or this also could be automated based on the support provided by the integration platform of your choice.
Similar to the earlier cases, we will be using the WSO2 Integration Platform as an example to demonstrate the implementation of this.
With previous error handling scenarios, we learnt about fault sequences and the basic error handling approach. Then we discussed a few patterns, such as message-store and invalid-message-channel, while learning how to use tools that the WSO2 Integration Platform is equipped with to support such an implementation.
Implementation of the dead-letter-channel is not very different from applying the same techniques that we already learnt, but at different locations in the mediation flow. At a very basic level, this can be mentioned as a situation where the integration platform tries to send a message to a particular receiver, and then fails dumping the already processed message into a fault sequence. Once the message becomes available in the fault sequence, the integration platform has the capability of publishing the message into a different channel; this is very similar to what we did with the invalid-message-channel implementation.
Note that at this stage we have a successfully processed message unlike the previously discussed error scenarios. Therefore, re-delivery of the same message is what we need to consider in this case, but not re-processing.
The implementation of the above basically consists of two sequences and essentially two mediators. The send or call mediator in the first sequence tries to deliver messages to the receiver system.
<sequence name="dead_letter_demo_sequence" onError="dead_letter_demo_error_sequence" trace="disable" xmlns="http://ws.apache.org/ns/synapse"> <call> <endpoint key="stockquote_endpoint"/> </call> </sequence> <endpoint name="stockquote_endpoint" xmlns="http://ws.apache.org/ns/synapse"> <address format="soap11" trace="disable" uri="https://xxxx:9000/services/SimpleStockQuoteService"/> </endpoint>
Once this fails, the currently processing message is dumped to the fault sequence that is associated by the previously mentioned sequence. Inside the fault sequence, such messages are stored in a message store, very similar to the cases we discussed earlier.
<sequence name="dead_letter_demo_error_sequence" trace="disable" xmlns="http://ws.apache.org/ns/synapse"> <store messageStore="dead_letter_store"/> </sequence>
The message-store, in this particular case, acts as the channel for messages that failed in delivery or rather as the channel for dead-letters. Re-delivery of messages can either be automated or the messages could be left in this channel until a manual fixing process is executed.
What was not discussed until now related to the above presented code segments is the artifact type called the endpoint, which is one the key components of a particular mediation flow. Given that the study of endpoints has, comparatively, an important scope to be covered, we would be discussing that separately in the next section.
The receiver or the destination of the message could be a web service, a message broker, or some other different type of destination, such as an FTP server. Such a destination or the location of a particular receiver system is logically represented in the mediation context of the WSO2 Integration Platform with an artifact type called 'endpoint'. Similar to how a mediator is added to a sequence, an endpoint also could be added at the end of a particular mediation flow being associated with a send or call mediator.
Once an endpoint with necessary parameters is added to a mediation flow, the integration engineer's job can be considered as 'done' as the actual delivery of the message at the transports and network level is handled by the integration platform internally.
The operation of sending a message to an external system basically involves many components of the integration platform at a very technical level. Right after performing a mediation over a particular message payload/content, the same payload/content-block is formatted into a standard message, which is suitable for sending to an external system. Then this message is immediately handed over to the transport handlers of the integration platform, which takes care of the delivery of the message by establishing a transports level communication channel between the integration platform and the intended receiver. To get a better understanding on how this works, you can refer to the component architecture section of WSO2 ESB official documentation.
Transports level elements of the integration platform are preinstalled and preconfigured; thus, it remains out-of-reach for integration engineers who typically design or modify integration logic. However, based on the errors that occur at this level, the transports handlers generate four property values (ERROR_CODE, ERROR_MESSAGE, ERROR_DETAIL, ERROR_EXCEPTION) and push them into the mediation context.
Mediation context is the scope that is accessible by integration engineers. Therefore, within a certain mediation flow, it is possible to read these error properties within the fault sequence and handle them accordingly if the delivery of a particular message fails.
For example, if a log mediator like the following is added to the fault sequence, the values carried by these properties could be observed.
<log level="custom"> <property name="text" value="Error occurred while sending message"/> <property name="error code" expression="get-property('ERROR_CODE')"/> <property name="message" expression="get-property('ERROR_MESSAGE')"/> </log>
The complete error codes list with the relevant descriptions is available on WSO2 ESB official documentation. This error code property can be used effectively when a selected set of errors need to be treated or handled in different ways. Especially the possible endpoint related failure situations could be handled in a much better and well-controlled manner by utilizing these error codes as inputs.
<endpoint name="stockquote_endpoint" statistics="enable" > <address uri="https://xxxx:9000/services/SimpleStockQuoteService" statistics="enable" trace="disable"> <timeout> <duration>60000</duration> </timeout> <markForSuspension> <errorCodes>101504, 101505</errorCodes> <retriesBeforeSuspension>3</retriesBeforeSuspension> <retryDelay>1</retryDelay> </markForSuspension> <suspendOnFailure> <errorCodes>101500, 101501, 101506, 101507, 101508</errorCodes> <initialDuration>1000</initialDuration> <progressionFactor>2</progressionFactor> <maximumDuration>60000</maximumDuration> </suspendOnFailure> </address> </endpoint>
In this code segment, we have added 3 main extra elements to the endpoint that we discussed in the previous example. Based on the nature of errors and depending on how those need to be handled, the error codes should be added to the markForSuspension and suspendOnFailure sections.
When an error code is added to markForSuspension section, the endpoint will be moved into Timeout state, where it would keep retrying before suspension (at a given retryDelay; for number of times that is given under retriesBeforeSuspension). The error codes defined under suspendOnFailure will directly put the endpoint into Suspended mode once the these errors occur. Relevant technical details of endpoint error handling are available on official WSO2 ESB documentation.
This capability can be used effectively along with the dead-letter-channel implementation (which we discussed earlier) by modeling the solution to publish messages to the dead-letter-channel for specific error codes such as the error codes added in the suspendOnFailure section. For this, something like the Filter mediator or Switch mediator could be used with previously mentioned get-property('ERROR_CODE') expression.
<filter source="get-property('ERROR_CODE')" regex="101500"> <then> <store messageStore="dead_letter_message_store"/> </then> <else> <drop/> </else> </filter>
<switch source="get-property('ERROR_CODE')"> <case regex="101500"> <store messageStore="dead_letter_auto_retry_channel"/> </case> <case regex="101507|101508"> <store messageStore="dead_letter_human_intervene_channel"/> </case> <default> <drop/> </default> </switch>
In the two integration code segments given above, we have done nothing really new; rather we have used different combinations of the basic techniques that we have studied so far. In different integration solutions with different integration requirements, the integration error handling related requirements and concerns would be very different from each other. Therefore, this is the exact approach we need to follow when designing the failure or error handling cases by applying the same set of techniques in many different combinations to address each case in a unique manner or by following a commonly applicable pattern wisely when possible to do so.
In this article, we mainly focussed on one-way mediation flow scenarios where the integration platform holds the entire responsibility of delivering the messages to the intended endpoint or handle any possible error accordingly. In this case, the client or sender application only makes sure that a particular message was accepted by the integration platform and considers its responsibility ends thereafter. As a result, we need to implement different patterns and use many techniques within the integration platform to make sure the message is properly handled in all possible error scenarios. These possibilities were generalized into three main categories in this article, and the techniques to be used were discussed in each scenario in detail. Utilizing this knowledge of techniques and patterns in different combinations, you will be able to design integration error handling mechanisms when implementing an integration platform.