Multi-region Deployment for WSO2 Identity Server - Part 1

  • By Johann Nallathamby
  • 2 Apr, 2018

Multi-region deployments are becoming increasingly important for most types of customer facing software applications today. Following are some key reasons for considering a multi-region deployment:

  1. Multi-region high availability (disaster recovery)
  2. Better response times for users and applications
  3. Regulatory compliance - not having to store sensitive information outside the region it belongs to

It is obvious that you cannot achieve all 3 of the above objectives in the same deployment. For example, if you need to partition user information by region in order to comply with regulations, you have to forgo on response time to some extent. Likewise, you need to settle for trade offs depending on your priorities.

This article is going to focus on multi-region deployments for WSO2 Identity Server (IS) and as such, we assume that the resources protected by WSO2 IS, such as web applications, APIs, etc. are as close as possible to the user, to WSO2 IS, and to each other, so that they they don’t significantly impact on the overall response time.

Data Replication

Data replication is a crucial discussion topic when considering multi-region deployments. WSO2 IS data can be broadly categorized into three significant types as related to the subject of this article:

  1. Identity and entitlement data (username, user password, user attributes, roles and user-role assignments, etc.)
  2. Configuration data (service providers, identity providers, XACML policies, etc.)
  3. Operational data (user sessions, OAuth2 access tokens, logs, analytics, etc.)

Replication of data can be handled either at the database level or at the application level. At the database level, we rely on database technologies to handle the replication and there are mainly two types of techniques:

  1. Multi-primary replication across multiple regions
  2. Primary-secondary replication across multiple regions

In multi-master replication across regions, all regions can accept write requests and sync modifications to the rest of the regions, maintaining read and write availability through region failures. In primary-secondary replication, a single region is designated as the master which is responsible for write operations. Multiple secondary regions can read the data, and data is always synced from primary to secondary.

The multi-master replication technique has its limitations. Most multi-master systems are only loosely consistent, i.e. lazy and asynchronous, violating ACID properties. Eager replication is complex and increases communication latency. When the number of nodes increases and latency also increases, issues such as conflict resolution become intractable. In contrast, the primary-secondary technique doesn’t suffer from these problems.

Not all database vendors are able to support multi-master replication across multiple regions. For example, Oracle can support it with Golden Gate software. At the time of writing, Amazon Aurora MySQL multi-region multi-master replication is expected to be generally available by the end of 2018 [1].

On the other hand, there are certain multi-region use cases where primary-secondary replication will not be sufficient. For instance, some organizations have a global user base but administration of these identities and IAM systems are handled centrally from one region. Primary-secondary replication can be used in a scenario such as this. There are other organizations that have a global user base (for example, they may have employees and customers across different regions) and IT administration takes place in each region, including the administration of identities and the IAM system. A scenario such as this requires multi-master replication (and not primary-secondary replication).

Another point to note is that with the Primary-secondary replication technique, the secondary nodes can’t do write operations that require consistency across all the regions. They may be able to do write operations that are not required to be synced across regions for consistency.

LDAP directories on the hand are not widely known to support multi-master replication across regions.

Considering all of the above discussed facts, a data replication technique independent of the underlying database technology could be your first preference.

Based on the 3 types of data, we can identify 3 types of multi-region deployment use cases for WSO2 IS:

  1. Sync identities, entitlements and configuration across all regions
  2. Sync only configuration across regions but partition identities and entitlements by region
  3. Partition all data by the region they belong to

You can see that we’ve left out replication of operational data from all the scenarios. This is because replicating operational data can be very costly due to very frequent updates and WSO2 IS doesn’t have an in-built mechanism to sync operational data across regions, relying solely on the database replication technology to handle this. If the use case requires us to replicate operational data and the relevant database technology can support the requirement with consistency and low latency, then we can use it. One of the drawbacks of not being able to replicate operational data across regions is that in the event of failover, the user may be required to re-authenticate or the application may be required to re-start the operations currently in progress. Another drawback is that if the routing policy is setup to route to the nearest region, that could change if a user travels across regions during an active session, which again would result in re-authenticating the user or restarting the operation in progress.

We are going to look at multi-region deployment patterns for use case 1, which is replicating identities, entitlements and configuration across all regions. Specifically we are going to look at an alternative approach to multi-master replication technique for an organization which has IT operations globally. This use case will ensure high availability across regions, better response times for users, and it doesn’t partition data by region.

Multi-region Deployment for Use Case 1 - Sync Identities, Entitlements, and Configuration Across Regions

In order to understand this section better, a definition of what is meant by regional identity provider is needed. This phrase refers to the closest identity provider in the region, and has to be specified relative either to the physical user or application instance.

Figure 1: Multi-region highly available deployment of WSO2 Identity Server

In this deployment, since identities, entitlements, and configurations are replicated across all the regions, users and applications could potentially interact with any of the regional identity providers.

Traffic from users and applications will be routed to the regional identity provider using the geolocation routing technique, ensuring minimum response times for users and applications.

At any given time, each region will have an active data centre and a passive data center (disaster recovery) mirroring one another. Within the region the traffic will be routed to the active data center using failover routing technique. If the active data center goes down, the passive data center takes over as the new active data center. If the cost of managing multiple data centers is a concern, one region can act as the disaster recovery site for another region. In other words a data center will act as the primary data center for its region and secondary data center for another region. In this case there won’t be any passive data centers because all the data centers will be actively serving users and applications from their region. If a data center goes down, it means that the entire region is down because there is only one data center in a region, and until that data center comes back up the traffic will be temporarily routed to the nearest region. This will cause a increase in response times experienced by users and applications in the affected region, however this could be considered tolerable because disaster recovery sites are not expected to operate for a long time at a stretch.

Clustering will only be done within a data center, with no clustering in-between data centers.

Now we will look at two possible network topologies for application-level replication of identities and entitlements across regions.

Fully Connected Mesh Topology for Provisioning Service Providers

Figure 2: Fully connected mesh topology for bi-directional syncing

In this topology all the regional identity providers, would also function as provisioning service providers (PSPs) and provisioning service consumers (PSCs). Identities and entitlements data will be synced between the PSPs using provisioning service calls. When identities or entitlements are added/modified/deleted in a PSP, the change will be persisted in its local user store, as well as it will be synced to all other PSPs.

The configuration data of the PSPs will not be synced because, they will be slightly different to one another in the mesh topology. Moreover WSO2 IS doesn’t have an in-built mechanism to sync configuration data across regions. Therefore configurations will be manually done for each region as the number is manageable.

The fully connected mesh topology is good as long as the number of PSPs are manageable. One of the drawbacks of the above topology is that if the number of PSPs increase, the number of connections between these data centers PSPs grows quadratically, becoming difficult to manage. The PSPs become tightly coupled to one another which means adding/removing/updating a PSP becomes difficult. Furthermore, there is no single place to govern the syncing process. Monitoring failed requests and manual retrying has to be done in each PSP. To overcome this we could redesign this solution to follow a hub-and-spoke topology, which we will discuss next.

Hub-and-Spoke Topology for Provisioning Service Providers

In this topology, there is one central provisioning service hub (PSH) and all the PSPs connect to that PSH.

Figure 3: Hub and spoke topology for bi-directional syncing

The PSH can operate in two modes:

  1. Stateless hub
  2. Stateful hub

In the stateless mode, the PSH only propagates the change requests to the connected PSPs. It doesn’t persist inbound change requests. If one of the forwarded outbound request fails, all the other successful outbound change requests have to be rolled back. In the stateful mode, the PSH persists the inbound change requests in its persistent storage, in order to be able to retry syncing if a outbound change request fails. This provides reliable provisioning and a central place to check for unfulfilled outbound change requests and manually retry to individual PSPs.

By introducing a central PSH, we are introducing a single point of failure. The logical next step is to have a highly available deployment of the PSH by having a mirrored deployment across in another region. However by doing this we are once again faced with the same challenge of syncing data between the regions of the PSH.

In the stateless mode, there is no operational data stored in the PSH and therefore operational data replication is not a concern. Only configuration data replication is needed, which can either be done manually in each region as the number is manageable, or use a primary-secondary replication technique as configurations are done very rarely in the PSH.

In the stateful mode however, the failed outbound change requests are stored at runtime. To ensure fault tolerance of the PSH across regions, configuration as well as operational data replication is needed. Configuration data can be manually done in each region. For operational data on the other hand, we can either apply the application-level replication technique, or continue with no replication with the limitation that the failed outbound change request queue in the passive region would always have to be manually emptied. Given that multi-master technique across regions is supported by the database, it isn’t a bad solution either to sync configuration and operational data because the number of regions of the PSH will be typically limited to 2.

Figure 4: Highly available hub and spoke topology for bi-directional syncing

The deployment architecture for the hub-and-spoke model will be very similar to figure 1, but in addition there will be the PSH deployment similar to US, EU, and APAC deployments as shown in the diagram.

A Special Look at OAuth2

Before delving into this section further, I’d like to introduce the term regional authorization server, which refers to the closest authorization server in the region, where relying party applications are registered. This has to be specified relative either to the physical user or application instance.

Although this article does not particularly focus on the region of the resources protected by WSO2 IS, this topic must be discussed for certain use cases. One such use case is the OAuth2 authorization code grant flow which consists of two requests. The first is the authorization request which is a redirect via user agent from the OAuth2 client application to the authorization endpoint of WSO2 IS. The other is the token request which is a backchannel request from the OAuth2 client application to the token endpoint of WSO2 IS. The authorization request, being a redirect via the user agent, goes to the regional authorization server of the physical user in this multi-region deployment. However, the token request, being a backchannel request from the application instance, goes to the regional authorization server of the application instance. These regions may not necessarily be the same.

For example, in a situation where the user is accessing an application, and the application instance is running in a different region to that of the physical user, there is a possibility for the following problem to occur. The authorization call will go to the regional authorization server of the physical user and successfully return the authorization code, because the user account data and OAuth2 client application data are found in the regional authorization server of the physical user. However, since the application instance is running in a different region, the token call will go to the regional authorization server of the application instance and fail because it won’t be able to validate the authorization code as the authorization codes are not synced across regions.

To overcome this problem, the applications have to make the token calls using the region specific DNS URL. The following solutions can be considered and they’re listed in increasing order of implementation complexity.

Some of these solutions use self-contained authorization codes to encode the region that issued the authorization code. For example, a self-contained authorization code may be a JSON String in the following format:

{"random":"abcd…1234","region":"asia"}

  1. Have the logic to find the user agent’s region based on IP in the application side and invoke the corresponding token endpoint.
  2. Advantages:

    • No extensions required in WSO2 IS
    • There could be always only one call over the network to the token endpoint to get an access token

    Disadvantage:

    • Additional code required in the application to find user agent’s region based on IP
  3. Use self-contained authorization codes, so that the application can read it and invoke the corresponding token endpoint.
  4. Advantage:

    • There could be always only one call over the network to the token endpoint to get an access token

    Disadvantages:

    • As of the current version of WSO2 IS, extensions are required to implement self-contained authorization codes
    • Customizations are required in the application to parse the self-contained authorization codes
    • Since the authorization code is not opaque to the application the flow deviates from OAuth2 best practices
  5. Use self-contained authorization codes and respond with either HTTP/1.0 302 or HTTP/1.1 307 status code from the regional authorization server of the application, to inform the client to repeat the request with the specified regional authorization server.
  6. Advantage:

    • No customizations required in the application, as long as the application can support the 302 or 307 status codes

    Disadvantages

    • As of the current version of WSO2 IS, extensions are required to implement self-contained authorization codes
    • In the worst case scenario, there will be two network calls to the token endpoint to get an access token
  7. Use self-contained authorization codes, so that the regional authorization server of the application instance can bridge the token call to the regional authorization server of the physical user.
  8. Advantage:

    • No customizations required in the application

    Disadvantages:

    • Extensions required in WSO2 IS and in the worst case scenario, there will be two network calls to two different WSO2 IS instances to get an access token
    • This design is generally discouraged from a security point of view because the access token which belongs to one domain is proxied through another domain, exposing it to another party apart from the client and the authorization server which issued it

Sequence diagrams will help understand the flow better. After applying either solutions 1 or 2, the sequence diagrams will look like the one below - it is identical to the regular OAuth2 flow:

Figure 5: Sequence diagram for OAuth2 flow after applying solution 1 or 2

The sequence diagram after applying solution 3 will be as follows:

Figure 6: Sequence diagram for OAuth2 flow after applying solution 3

The sequence diagram after applying solution 4 will be as follows:

Figure 7: Sequence diagram for OAuth2 flow after applying solution 4

Central Authentication Server (CAS) is also a similar authentication protocol which consists of two calls, one redirection via the user agent and another direct call to the identity provider.

Another use case where the region of the resource becomes important, is the OAuth2 bearer access token introspection call. The bearer access token that is issued by the authorization server has to be validated at the resource server, by sending it over to the authorization server for each API invocation. This is generally acceptable for a single region solution. However, when you expand the same solution to multiple regions, the delay introduced by the access token introspection call that goes across different regions is significantly high. One of the mechanisms which can be used to reduce this delay is having an access token cache in the gateway. This will make sure that only the first API invocation using a particular token will have to go over the network for introspection. The subsequent invocations with the same access token will be validated from the gateway cache.

Caching in the gateway would be acceptable for certain organizations depending on their security policies. However, it may not be acceptable in some organizations due to their stringent security policies. In such situations you could use self-verifiable/self-contained access tokens [2]. Using these kind of access tokens have the advantage of not having to introspect them by calling the authorization server that issued them, but by verifying the authenticity of the tokens by verifying their signatures and parsing the information encoded in them.

In this article I’ve provided a high-level overview of the different multi-region deployment use cases for WSO2 Identity Server. We specifically deep-dived into use case 1, where we replicate identities, entitlements, and configurations across regions instead of relying on the multi-master database replication technique, with a special focus on organizations that have global IT administration.

In part 2 of this article series, we will deep-dive into use cases 2 and 3, where we replicate only configuration data across regions but partition user data by region, and partition all data by region respectively.

References

[1] https://www.youtube.com/watch?v=4XL1VZymTA8

[2] https://wso2.com/library/blog-post/2014/10/blog-post-self-issued-access-tokens/

About Author

  • Johann Nallathamby
  • Associate Director/Solutions Architect
  • WSO2