2009/11/27
27 Nov, 2009

WSO2 Enterprise Service Bus - Endpoint Error Handling

  • Supun Kamburugamuva
  • Technical Lead - WSO2

WSO2 Enterprise Service Bus - Endpoint Error Handling

Contents

Terminology

Service Provider Endpoint: WSO2 Enterprise Service Bus acts as a central distribution point, which distributes messages received from clients to the relevant service provider endpoints

WSO2 Enterprise Service Bus Endpoint: This is a representation of a service provider endpoint which lies inside WSO2 Enterprise Service Bus configuration

Introduction

Enterprises are inherently complex, comprising of hundreds of applications with completely different semantics. Some of these applications are custom built, where as some are acquired from third parties and some even can be a combination of both and they can be operating in different system environments.Most of these applications do run on different system environments, which makes it more complex. Due to these factors, it is vital to integrate these heterogeneous applications.

WSO2 Enterprise Service Bus has the capability to provide the core technology to integrate these applications, which can be defined as the state of the art for Enterprise Application Integration (EAI) which is based on SOA principles. With WSO2 Enterprise Service Bus, the entire process gets enhanced by performing various tasks like message transformation, routing, transport switching, before sending the message to the service provider.

The main configuration building blocks which comes under WSO2 Enterprise Service Bus are as follows:

  1. Mediators
  2. Proxy services
  3. Endpoints
  4. Tasks

Mediators are the functional components inside WSO2 Enterprise Service Bus. They do various things like logging, XSLT transformation, sending messages out, XPath based filtering etc. So mediators are at the core of message processing inside WSO2 Enterprise Service Bus. The last step of a message processing inside WSO2 Enterprise Service Bus is to send the message out to a service provider. As far as the WSO2 Enterprise Service Bus is concerned it sends the message to a listening service endpoint. The message sends from the WSO2 Enterprise Service Bus to the service can be very different from the incoming message. For example, incoming message may be a SOAP 1.1 message over HTTP. But WSO2 Enterprise Service Bus can send this message to the service as a SOAP 1.2 message over JMS. In this case even though the WSO2 Enterprise Service Bus endpoint exposed to the client is a SOAP 1.1 over HTTP, the actual endpoint is a SOAP 1.2 over JMS.

An Endpoint is an abstraction of the service provider. Once a message is sent using a mediator, it should know what the service provider endpoint is. The endpoint is capable enough to provide this information. In a scenario where several ideal service endpoints serves requests of the same type, WSO2 ESB can be used as a Load Balancer. The main reason behind it is that all these endpoints are having the same functionality and it is natural to view them as a single unit.

WSO2 Enterprise Service Bus has the concept of building endpoints to represent service provider Endpoints. The following are the Endpoints built on WSO2 Enterprise Service Bus.

  1. Address Endpoint
  2. WSDL Endpoint
  3. Default Endpoint
  4. Load Balancing Endpoint
  5. Fail-Over Endpoint

Out of the above mentioned lot, the most widely used Endpoints are Address and WSDL Endpoints. Each Endpoint has its own XML configuration in the Synapse Language. Synapse Language is an XML language used for configuring the Enterprise Service Bus.

Since Endpoints send the message out, they can encounter various transport errors. For example connection may time out, or connection may be closed by the actual service.

Why Error Handling is important?

Since WSO2 Enterprise Service Bus is a long running application, transient failures can occur time to time. Retiring on transient failures enhances the fault tolerance of a system. For a WSO2 Enterprise Service Bus, transient failures includes communication failures between service providers and WSO2 Enterprise Service Bus, data base operation failures and so on. The communication failures are the most frequent ones.

So Endpoint error handling is a key part of any successful Enterprise Service Bus deployment. Since most of the time we are using TCP, people may think, how can errors occur? After all TCP is very reliable, isnt it?. But not in real life. Messages can fail or lost due to various reasons in a real TCP network. These errors may be very rare. But they can occur. For understanding the importance of error handling consider the following scenario.

Usually when an error occurs and if the WSO2 Enterprise Service Bus is not configured to accept the error, it will mark the Endpoint as a failure. This leads to a message failure. By default, Endpoint will be marked as failed for a quite a long time. The error WSO2 Enterprise Service Bus encountered can be a intermittent issue that occurs once in a week, but due to this single error, subsequent messages may get lost. Of course you can configure WSO2 Enterprise Service Bus to handle this kind of situations and this article will give you an in depth understanding of how Endpoints work so that you can configure it optimally.

Concepts

We call Address, Default and WSDL Endpoints as Leaf Endpoints. They do the actual sending of the message. A Load Balance or Fail-Over endpoint uses one or several of these Leaf Endpoints to send the message. So a Load Balance or Fail-Over Endpoint is a logical grouping of Leaf Endpoints. A Load Balance or Fail-Over Endpoint never sends the message directly. Instead they delegate the sending to the Leaf Endpoints, depending on the configuration and the status.

Endpoint is an abstraction of the remote server, where WSO2 Enterprise Service Bus is sending the messages out. It can specify what are the message properties used for sending the message. For example it can specify the message format, i.e. SOAP 1.1 or it can specify the WS Security policy for the outbound message.

WSO2 Enterprise Service Bus Endpoint has the configurations to specify its behavior on error conditions, which might occur between WSO2 Enterprise Service Bus and the actual service Endpoint.

An Endpoint has a state. Before going on to the Endpoint configurations we will look at the states and how the transition happens.

Endpoint States

At any given time, the state of the Endpoint can be Active, Timeout, Suspended or OFF . The Endpoint state transition normally happens on a message basis. To put an endpoint in to OFF state, we need to use JMX, so that the state is not in the State Transition Diagram. Let us now analyze the different endpoint states in detail:

State Description
Active Endpoint is up and running.
Timeout Endpoint encountered an error, it is a candidate for suspension. If it continues to encounter errors it will be suspended. It can still send messages.
Suspended Endpoint encountered errors and is sent to a state where it cannot send requests. It cannot send messages and messages coming to it will result in a fault.
OFF Endpoint is not active.

 Ednpoint States

 

ACTIVE

When WSO2 Enterprise Service Bus boots up, Endpoints are in the Active state and ready to send messages. If the user doesn't put the Endpoint into OFF state, it will be in the Active state until an error occurs.

When an error occurs, the Endpoint can be configured to stay in Active or to go to Timeout or Suspended state. Every error has an error code. Endpoint configuration allows you to define the errors to put the Endpoint into Timeout and Suspension modes. If a particular error is not defined for Timeout or Suspended states, the error will be ignored.

So errors are handled in three ways:

  1. Put the Endpoint into SUSPENDED state
  2. Put the Endpoint into TIMEOUT state
  3. Ignore and stay in ACTIVE state

If the specific error does not have a specified Time out then the Connection Close will be treated as TIMEOUT errors. All the other errors will put the Endpoint into a SUSPENDED state.

When an error occurs Endpoint will first try to see whether it is an error for putting the endpoint into TIMEOUT. Then it will check whether it is an error for putting the endpoint in to SUSPENDED state.

TIMEOUT

In this state Endpoint can forward messages bound to a maximum number of continues failures. If it continously fails and the maximum number exceeds Endpoint will be marked as SUSPENDED. If one message succedd the endpoint will be marked as Active.

For example let's assume number of tries is set to 3. When an error occurs and endpoint is set to this state we have three tries. If the next three messages are sent using this endpoint, and encounters then the an error, the endpoint will be put to SUSPENDED state. If one of the messages succeeds before putting the endpoint into SUSPENDED state, endpoint will be marked as ACTIVE.

SUSPENDED

A suspended endpoint cannot be used for sending the messages. After endpoint is put in to this state, it can be tried again after a configurable time. After this time period expires, WSO2 Enterprise Service Bus will try to forward messages from this endpoint. If the message succeeds, then WSO2 Enterprise Service Bus will mark the endpoint as active. If the next message fails, the endpoint will be put to SUSPENDED or TIMEOUT state depending on the error.

The next period is calculated using the following formula.

Next suspension time period = Max (Initial Suspension duration * (progression factor try count), Maximum Duration)

All the variables in the above formula are configuration values used to calculate the try count. Try count means, how many tries occurred after the endpoint is SUSPENDED. As the try count increases, the next SUSPENSION time period will also increase. This increase is bound to a maximum duration.

Leaf Endpoint Configurations

This is the configuration for address endpoint. Since we all are only interested in error configurations, the same applies for WSDL endpoint as well. The error handling configuration are as follows:

  1. timeout settings
  2. markForSuspension settings
  3. suspenOnFailure settings

We will look at those setting individually.

<address uri="endpoint address" [format="soap11|soap12|pox|get"] 
    [optimize="mtom|swa"] [encoding="charset encoding"]    
    [statistics="enable|disable"] [trace="enable|disable"]>
	<enableRM [policy="key"]/>?
        <enableSec [policy="key"]/>?
        <enableAddressing [version="final|submission"] [separateListener="true|false"]/>?
        
        <timeout>
                <duration>timeout duration in seconds</duration>
                <action>discard|fault</action>
        </timeout>?

        <markForSuspension>
                [<errorCodes>xxx,yyy</errorCodes>]
                <retriesBeforeSuspension>m</retriesBeforeSuspension>
                <retryDelay>d</retryDelay>
        </markForSuspension>
 
        <suspendOnFailure>
	        [<errorCodes>xxx,yyy</errorCodes>]
                <initialDuration>n</initialDuration>
                <progressionFactor>r</progressionFactor>
                <maximumDuration>l</maximumDuration>
        </suspendOnFailure>
</address>

timeout

Name Values Default Description
duration Miliseconds 60000 Connection timeout interval. If a the remote endpoint doesn't respond in this time it will be treated as a timeout.
action discard, fault, none none When a response comes to a timed out request, weather to discard it or invoke the fault handler

markForSuspension

Name Values Default Description
errorCodes Comma separated list of error codes 101504, 101505 Errors to send the endpoint in to TIMEOUT state retriesBeforeSuspension
retriesBeforeSuspension Integer 0 In the TIMOUT state this number of requests minus one can be tried and can be failed before endpoint is marked as SUSPENDED retryDelay. This setting is a per Endpoint setting. It is not a per message setting. So several messages can be tried parrallely and fail and the remaining retries will be reduced.
retryDelay      

suspendOnFailure

Name Values Default Description
errorCodes Comma separated list of error codes All the errors except the errors specified in markForSuspension Errors to send the endpoint in to SUSPENDED state
initialDuration miliseconds 60 x 60 x 1000 After an endpoint gets suspended it will wait for this amount of time before trying to send the messages coming to it. All the messages coming during this time period will result in fault sequence activation.
progressionFactor  Integer  1 The endpoint will try to send the messages after the initialDuration. next duration = Max(initialDuration x progressionFactor ^ retry count, maximumDuration)
maximumDuration  miliseconds Long.MAX_VALUE Upper bound of retry duration

Sample Configuration

<endpoint name="Sample_First" statistics="enable" >
    <address uri="https://localhost/myendpoint" statistics="enable" trace="disable">
        <timeout>
            <duration>60000</duration>
        </timeout>
                
        <markForSuspension>
            <errorCodes>101504, 101505</errorCodes>
            <retriesBeforeSuspension>3</retriesBeforeSuspension>
            <retryDelay>1</retryDelay>
        </markForSuspension>

        <suspendOnFailure>
            <errorCodes>101500, 101501, 101506, 101507, 101508</errorCodes>
            <initialDuration>1000</initialDuration>
            <progressionFactor>2</progressionFactor>
            <maximumDuration>64000</maximumDuration>
        </suspendOnFailure>

    </address>
</endpoint>

Here we are moving the endpoint TIMEOUT state for errors 101504 and 101505. After this process, 3 requests can fail for one of these errors before moving the endpoint in to the SUSPENDED state.

We are putting the endpoint in to suspension for errors 101500, 101501, 101506, 101507 and 101508. But as you can see we are ignoring the error 101503. If error 101503 occurs, the endpoint will be in the ACTIVE state.

For more information about error codes refer APPENDIX A.

Failover Endpoint

With leaf endpoints, if an error occurs during a message transmission process, that message will be lost. The failed message will not be retried again. These errors do occur very rarely, but still message failures can occur. With some applications these rare message loses are acceptable, but sometimes even these rare message failures are not acceptable and Failover endpoint is the ideal solution for it.

Here is the configuration for failover endpoints. At the configuration level, a failover is a logical grouping of one or more leaf endpoints.

<failover>
       <endpoint .../>+
</failover>

When a message comes to the failover state, it will go through its list of endpoints to pick the first one in ACTIVE or TIMEOUT state. Then it will send the message using that particular endpoint. If an error occurs while sending the message, the failover will go through the endpoint list again from the beginning and will try to send the message using the first endpoint.

Some errors do put the endpoint in to TIMEOUT and some, keep the endpoint in the ACTIVE state. In these cases the retry can happen using the same endpoint. If the failure occurs with the first endpoint within the failover group and this error doesn't put the endpoint in to SUSPENDED state, the retry will happen using the same endpoint.

Failover gives priority to the first endpoint, which is not in the SUSPENDED state. So it will send the message through the first endpoint in the failover group, as long as it is not SUSPENDED. When the first endpoint is SUSPENDED it will send the requests using the second endpoint. When the first endpoint becomes ready to send again, it will try again, even though the second endpoint is still active.

If there is only one service endpoint and the message failure is not tolerable, failovers are possible with a single endpoint.

Here is a sample failover with one address endpoint.

Sample Fail-Over configuration

<endpoint name="SampleFailover">
    <failover>
        <endpoint name="Sample_First" statistics="enable" >
            <address uri="https://localhost/myendpoint" statistics="enable" trace="disable">
                <timeout>
                    <duration>60000</duration>
                </timeout>
                
                <markForSuspension>
                    <errorCodes>101504, 101505, 101500</errorCodes>
                    <retriesBeforeSuspension>3</retriesBeforeSuspension>
                    <retryDelay>1</retryDelay>
                </markForSuspension>

                <suspendOnFailure>
                    <initialDuration>1000</initialDuration>
                    <progressionFactor>2</progressionFactor>
                    <maximumDuration>64000</maximumDuration>
                </suspendOnFailure>

            </address>
        </endpoint>
    </failover>
</endpoint>

Here the Sample_First endpoint is marked as TIMEOUT if a connection runs out of time, a connection close or sends IO errors. For all the other errors, it will be marked as suspended. When this error occurs the fail over will retry using the first non SUSPENDED endpoint. In this case, it is the same endpoint (Sample_First). It will retry until the retry count becomes 0. The retry happens in parrellel. Since messages do come to this endpoint using many threads, the same message may not be retired 3 times. Another message may fail and can reduce the retry count. So it is important to note that the retry count is not a per message based setting, it is a per endpoint based setting.

In this configuration, we assume that these errors are rare and if they happen once in a while it is OK to retry again. If they happen frequently and continuously, which means that it requires immediate attention to get it back to normal state.

Conclusion

Handling errors at the endpoint level is crucial to any successful deployment. Errors are bound to be discovered by running tests. So it is recommended to run few long running load tests and fine tune the endpoint configurations for errors that can occur intermittently due to various reasons.

APPENDIX A

Error Codes

Error code Description
101000 Receiver IO error sending
101001 Receiver IO error receiving
101500 Sender IO error sending
101501 Sender IO error receiving
101503 Connection failed
101504 Connection timed out
101505 Connection closed
101506 HTTP protocol violation
101507 Connect cancel
101508 Connect timeout
101509 Send abort

 

Author

Supun Kamburugamuva, Software Engineer, WSO2, [email protected]

 

About Author

  • Supun Kamburugamuva
  • Technical Lead
  • WSO2 Inc