apim
2017/10/19
19 Oct, 2017

Benefits of a Multi-regional API Management Solution for a Global Enterprise

  • Lakmal Warusawithana
  • Senior Director - Cloud Architecture - WSO2

Introduction to Data Centers and Regions

A data center (DC) is a physical group of networked computer servers typically used by organizations for the remote storage, processing, or distribution of large amounts of data. A data center often requires extensive redundant or backup power supply systems, cooling systems, redundant networking connections, and policy-based security systems to run the enterprise's core applications.

A region is a geographical location with a collection of physical data centers in that region. Every region is physically isolated from and independent of every other region in terms of location, power, water supply, etc.

Multi regional deployment is a special case of multi data center deployment.

Why a Multi Data Center Deployment?

A multi data center deployment can provide valuable benefits to organizations and developers to deal with specific issues as follows:

  • Machine failures
  • Data center outages
  • Disaster recovery
  • Efforts to increase availability (more 9s)

When you have business-critical applications, their failures will naturally impact your day-to-day business. Failures happen in different ways. For instance, an application can fail due to a bug in the software, which can be addressed by applying fixes. Sometimes the machine that hosts the application can fail; for this, it’s recommended to run the application at least in two machines so it provides a minimum set of high availability (HA).

Typically, data centers have redundant or backup power supply systems, cooling systems, redundant networking connections, etc. But past experiences have shown that data centers can face outages due to various reasons.

Therefore, it’s important to have a disaster recovery (DR) site to minimize the impact from such outages.

If you are providing a set of services, your customers want to have availability guarantee of your services. To this end, a multi-data center deployment will help you to increase the availability of your services.

Why are Multi Data Center or Multi-regional Deployments Complex?

Many believe that replicating a database across the data center is sufficient when dealing with a multi-data center deployment. Even though it might hold true for simple websites, it’s not the case for middleware. Artifacts sometimes persist in file systems or databases, or even both. Some distributed components need to have coordination between them, while some components must maintain states to be able to make decisions. These types of applications, however, have some complexity when deployed in the single data center.

Given the complexity, configuring and operating a middleware stack consistently across multiple locations is a non-trivial (and expensive) task that potentially introduces more points of failure than a single data center deployment. Latency between data centers will introduce inconsistencies, and distributed decision making will also become more complex.

Maintaining the state across multi-data centers is challenging; therefore, depending on the use case, you would need to do stateful analytics on the collected data. If you carefully analyze all these factors, multi-data centers are not trivial and you need to weigh the benefits as well as the complexity of the deployment.

Recommended Multi Data Center Deployment Patterns

As explained earlier, the main benefits of having a multi-data center deployment are disaster recovery and HA. WSO2 recommends the following 3 patterns of active-passive deployment without making your deployment complex and unstable.

  1. Active - Passive all-in-one deployment
  2. Active - Passive separate GW and KM deployment
  3. Active - Passive fully distributed deployment

Pattern 1: Active - Passive All-in-One Deployment

Before going into the details of a multi-data center, let us look at an all-in-one deployment architecture in one (active) data center. Here, we have considered HA within the data center and included WSO2 API Manager (APIM) analytics in the deployment (Figure 1). Adding only LB in the DMZ prevents the opening of many ports to the outside.

Shared file system enables syncing of APIs and throttling policies across all nodes. “repository/deployment/server” should be mounted to the shared file system. Hazelcast clustering is required to sync state between two analytics nodes.

Figure 1

To obtain high availability with a minimum number of nodes, you need to have at least two APIM (all-in-one) nodes. These APIM nodes should be connected to the same database. For simplicity, we have named the database as APIM-DB, but it can be divided into UM-DB (user-manager), REGISTRY-DB, and APIM-DB.

Whenever you have more than one node in the deployment, you need to include a load balancer (LB) to distribute the load. API traffic will hit the LB and then be routed to the APIM nodes. We recommend to include the LB in DMZ and APIM in the MZ (or LAN) to avoid opening a DB connection into DMZ.

These APIM nodes will run independently (without clustering) and state sharing will be done using the database.

When we create APIs, it will create an API artifacts file (Synapse) and will be stored in the file system. To make two nodes synchronize, our first recommendation is to configure a shared file system (e.g. NFS). The “repository/server/deployment” directory of all the nodes should be mounted to the shared file system. If we can’t have a shared file system, we can use any file replication mechanism to sync the file-based artifacts. In that case, the “repository/server/deployment” directory should be replicated. Like the API artifacts, when we create advanced throttling policies, it will create Siddhi files and store these in the file system under “repository/server/deployment”. The same shared file system can be utilized to synchronize Siddhi files across all the nodes.

To achieve minimum HA, you need to have two APIM analytics nodes. APIM all-in-one nodes will publish analytics data to both analytic nodes and both APIM analytic nodes will process and analyze the data. APIM analytics nodes need to be clustered (via Hazelcast clustering) and it will coordinate who will write the row data to the analytics-DB and process data to the stat DB.

An active data center will have significant traffic all the time. In the event an active data center fails, traffic should be routed to a passive data center. A passive data center deployment should be the same as an active data center deployment.

Let us look at how you can do state syncing from an active to a passive data center.

Figure 2

As illustrated in Figure 2, there are 3 tasks (A, B, C) that need to be carried out to sync data center 1 (DC1) to data center 2 (DC2).

A - All tables of APIM-DB (including UM-DB, Reg-DB) of DC1 should be replicated to DC2 via primary-secondary database replication

B - All tables of Analytics-DB (row data) of DC1 should be replicated to DC2 via primary-secondarydatabase replication

C - DC1 to DC2 shared file system should be replicated using RSYNC (or any file system replication mechanism)

To seamlessly work active/passive, you need to have global LB to control traffic routing. When the active DC fails, traffic should be routed to the passive DC.

We can have two options to revive the failed DC.

  • Option 1: Database and the filesystem of the failed DC should be fully synced before making it an active DC again.
  • Option 2: The failed active DC can be brought up as passive. Database replication and RSYNC should be reversed.

Pattern 2: Active - Passive Separate GW and KM

When considering a scalable architecture, a separate gateway (GW) and key manager (KM) nodes with Admin, Publisher, and Store with a Traffic Manager node is the recommended pattern.

Let us look at an active data center deployment architecture with a separate GW and KM (Figure 3).

Figure 3

When we create APIs in the Publisher node, it will call a REST endpoint of the Gateway to create an API artifacts file (Synapse) and then store this in the file system. To make two gateway nodes synchronize, our first recommendation is to configure a shared file system (e.g. NFS). The “repository/server/deployment” directory of all the nodes should be mounted to the shared file system. Since we have two (or more) gateway nodes, we should have an LB between the Publisher and Gateway nodes. This will help to omit a single point of failure of having a gateway manager.

If we can’t have a shared file system, we have to use any file replication mechanism to sync synapse artifacts among all gateway nodes. In that case, the “repository/server/deployment” directory should be replicated. In such an instance, from Publisher, we have to point one Gateway node to call a REST endpoint to create a synapse file. We can consider it a master Gateway node and from here do a file replication to other gateway nodes.

Advanced throttling policies are created by using the Admin portal. It will call the Traffic Manager (TM) rest endpoint to create “Siddhi” (throttling policy) artifacts and store these in a local file system. Given that we have two nodes for HA purposes, you can create these files from either node. Moreover, having a shared file system between these two nodes solves the syncing issues.

The two KM nodes give HA to the key manager service. We then need to have a LB in front of these two nodes. API authentication/authorization traffic, however, are not session aware. Hence, if you want to have SSO in the Store and Publisher, you need to ensure session aware LB.

Let us look at the deployment diagram for multi datacenter depicted in Figure 4.

Figure 4

There are 3 tasks (A, B, C) that need to be carried out to sync data center 1 (DC1) to data center 2 (DC2).

A - All tables of APIM-DB (including UM-DB, Reg-DB) of DC1 should be replicated to DC2 via primary-secondary database replication

B - All tables of Analytics-DB (row data) of DC1 should be replicated to DC2 via primary-secondary database replication

C - DC1 to DC2 shared file system (for APIs) should be replicated using RSYNC (or any file system replication mechanism)

D - DC1 to DC2 shared file system (throttling policies) should be replicated using RSYNC (or any file system replication mechanism)

Pattern 3: Active - Passive Fully Distributed Deployment

A fully distributed deployment provides more scalability and isolation among the different API manager profiles (Figure 5).

Figure 5

The difference between a fully distributed and separate KM and GW deployment pattern is running a dedicated Publisher, Store, and Traffic Manager (TM) profiles. Since we are running separate Publisher, Publisher to TM (throttling policy create calls), and Publisher to GW (API create calls), calls should be routed via a LB. This will help to avoid a single point of failures.

A deployment scenario for multi datacenter should look like the one depicted in Figure 6.

Figure 6

A - All tables of APIM-DB (including UM-DB, Reg-DB) of DC1 should be replicated to DC2 via primary-secondary database replication

B - All tables of Analytics-DB (row data) of DC1 should be replicated to DC2 via primary-secondary database replication

C - DC1 to DC2 shared file system (for APIs) should be replicated using RSYNC (or any file system replication mechanism)

D - DC1 to DC2 shared file system (throttling Policies) should be replicated using RSYNC (or any file system replication mechanism)

What About Active-Active Multi Datacenter Deployment?

An Active-Active multi datacenter deployment comes with the following dependencies:

  • Master-Master Database replication
  • Master-Master (both way) file sharing mechanism
  • Fast network access (low latency)
  • Reliable network

If you can achieve the above dependencies, it is possible to have an active-active multi datacenter deployment. However, given our past experiences, we have seen the following limitations with such a deployment:

  • Throttling will have to be approximated, and not precise. High latency in between Gateway node and TM nodes will decrease accuracy of execution of throttle out decision. When database and filesystem synchronization takes a significant amount time (due to high latency) , two datacenters will be have an inconsistency to a time period. This leads system to Inconsistent.
  • If we consider all these factors, and when weighing the complexities versus benefits, an active-active multi datacenter deployment is not recommended unless you can do traffic partitioning.

Use Cases of Multi Region Deployment

Multi regional deployment can provide valuable benefits to organizations. A few of them are as follows:

  • Regional data center outages
  • Disaster recovery
  • Increase availability (more 9s)
  • Reduce latency for regional access
  • Help to partition traffic among the regions
  • Align with regional regulatory enforcement. (application calls)
  • Avoid cloud vendor lock-in (with multi-cloud deployment)

The key difference between a multi-regional and multi-datacenter is that regional data centers can be fully isolated with large distances between them as they would generally be across different countries or continents. Therefore, traffic partitioning will provide tremendous benefits in a multi-regional deployment scenario. Regional traffic can be routed to the nearest region and it will give low latency and fast access to the APIs (given that backend services are also deployed in the region). In addition, it will help to get tottering decisions optimized to regional access of the APIs.

Sometimes businesses need to comply with regional regulatory enforcement; therefore, in such a scenario, you might need to terminate API traffic within the regions. One way of achieving this is by using multi regional deployment.

Many organizations use public clouds to get infrastructure services and a common major concern is vendor lock-in. Multi regions can also be considered as multi cloud providers. Therefore, having a multi regional deployment can be extended to a multi-cloud deployment, which in turn will help to avoid lock-in.

You can have multi-regional deployment by covering two key use cases as follows:

  • Use Case 1: Deploy all APIs in all regions. API traffic will be partitioned and be routed to the relevant region.
  • Use Case 2: Deploy selected APIs in selected regions. API traffic will be partitioned and routed to the relevant region.

The following section will explain these use case as separate deployments; however, it’s possible to merge these two into one as well.

Use Case 1: Deploy All APIs in All the Regions

In this scenario, we deploy all the APIs in all the regions. It is the same as discussed in the multi data center use cases, but API traffic will be partitioned region by region.

Deployment can be any sort of pattern, but we have used a fully distributed pattern. A distributed pattern will help to scale the deployment.

Figure 7

In the second region, we have used separate GW, KM, and TM, but you can combine them too. The only concern when doing so, however, is that you should only have API runtime components in the second DC, not the management components (publisher, store or admin apps). Publisher, Store and Admin traffic should be routed to one (master) region.

Having a separate KM in the second region will help to reduce latency for key validation. In the same line, having a TM in the second region will help to increase accuracy of the throttling. Regionalized backend services will increase performance, and we can use dynamic endpoints to auto select backend services in those regions.

Let us take a detailed look of synchronization across the regions.

A - All tables of APIM-DB (except IDN_OAUTH2_ACCESS_TOKEN table) of Region 1 should be replicated to Region 2 via primary-secondary database replication. Since we are having API traffic partitioning, one token will be only valid for one region. If someone switches regions, they need to create a new token to work in the new region.

B - Region 1 to Region 2 shared file system (for APIs) should be replicated using RSYNC (or any file system replication mechanism)

C - Region 1 to Region 2 shared file system (throttling Policies) should be replicated using RSYNC (or any file system replication mechanism)

D - API analytics traffic should be published to the master region. Analytics is stateful and, therefore, we can’t do local (regional) analyzing and mergers the results unless you have partition key in the API analytics data. Other reasons are:

  • Some analytics (fraud detection, anomaly detection etc) need to get all API analytics data to analyze and provide accurate decisions
  • Publisher, Store, and Admin app have the presentation layer of these analytics

Use Case 2: Deploy Selected APIs in Selected Regions

In some use cases, we need to push selected APIs to particular regions. API management can be done in a master region and access to these APIs can be regionalized. In such a scenario, the following multi regional deployment will help. In this scenario too we have the same assumption that API traffic will be partitioned according to region (Figure 8).

Figure 8

As discussed earlier, you can use any deployment pattern in Region 1 and depending on the optimization, you can select the Region 2 deployment pattern.

Let us take a detailed look of the synchronization across the regions. One important note is that since we have selectively pushed APIs to regional GW, there’s no need to sync all API files across the regions.

A - All tables of APIM-DB (except IDN_OAUTH2_ACCESS_TOKEN table) of Region 1 should be replicated to Region 2 via primary-secondary database replication. Since we have API traffic partitioning, one token will only be valid for one region. If someone switches regions, they need to create a new token to work in the new region.

B - Region 1 to Region 2 shared file system (throttling Policies) should be replicated using RSYNC (or any file system replication mechanism)

C - API analytics traffic should be published to the master region. Analytics is stateful and, therefore, you can’t do local (regional) analyzing and merge the results unless you have a partition key in the API analytics data. Other reasons are:

  • Some analytics (fraud detection, anomaly detection, etc.) need to get all API analytics data to analyze and provide accurate decisions
  • Publisher, Store, and Admin app have the presentation layer of this analytics

D - Selected API should be pushed to Region 1 (master) to Region 2. When you add an external GW (regional GW) to the publisher it will appear in the API publishing window. Then from the Publisher node (in master), initiate an API call to the Regional gateway (this can be via LB to avoid a single point of failure) to create the API files in the regional GW file system.

Conclusion

A regional multi-data center deployment can provide valuable benefits to organizations and developers to deal with specific issues that are common across many verticals. Among these are machine failures, data center outages, disaster recovery, and organizations’ efforts to increase availability. Multi-regional deployments also help organizations to meet regulatory requirements and avoid cloud vendor lock-in. While there are many patterns an organization can explore for a multi-data center deployment, we recommend an active-passive deployment as it helps to minimize complexities, and ensures stability and scalability. To this end, partitioning of traffic is the recommended way in a multi-region deployment. In this use case, publisher/store/admin traffic should only be routed to one selected region and application traffic should be partitioned based on regions to ensure stability.

 

About Author

  • Lakmal Warusawithana
  • Senior Director - Cloud Architecture
  • wso2 inc