What Does It Take to Deliver a Successful API?
By Nuwan Dias
- 9 Apr, 2020
Modern businesses are highly consumer driven. Delivering value to our customers is, therefore, our top priority. Making customers' tasks more convenient and efficient should be our primary goal. To do that we need ways to figure out “what” exactly makes our customers more efficient and bring them convenience in their tasks. This requires a lot of trial and error. It also requires us to build and experiment with systems and features to see if these capabilities actually bring significant value to our customers. This is the primary motivation that drives enterprise architecture to be much more disaggregated and compose-able. Heard about microservices anyone? Well, there you go. This is why microservices have become quite popular. Microservices enable traditional businesses, riding heavily on monoliths, to be disaggregated into much smaller independent units. This is what enables us to introduce new capabilities into our systems much faster and with a much lesser impact on other areas of our systems. This is what creates a platform that can help us find out “what” exactly gives value to our customers much faster and easier than we could before.
While microservices are great at making our businesses much more agile and efficient, APIs are what delivers the value of our microservices to our consumers and customers. APIs sit at the edge between our customers and microservices, connecting the two to create amazing user experiences. APIs, therefore, sit at the forefront of delivering value to our customers. In this article, we look at what we need to consider as architects, CxOs, and other decision makers to deliver a successful API system. We talk about the following points that are vital in delivering a successful API ecosystem:
- The right delivery model for APIs
- API governance
- Composability of APIs
- API security
- Scaling APIs
- Availability of APIs
- Insights through APIs
Flipping the Traditional Delivery Model of APIs on Its Head
As an organization/enterprise that delivers APIs to your consumers, you are probably familiar with the below model of delivering APIs.
Traditional model for delivering APIs
This is a model where we use a portal to design, implement, and document our APIs first. This model is deployed to our gateways and developer portals through publishing. The API then becomes available for discovery and consumption by application developers. While this model has served us well, it is now becoming a bottleneck to the agility of processes that deliver value to customers. The processes we have in place to deliver microservices are much more efficient, easier to automate, and convenient to roll-back in case of failure. As such, we need to adopt a similar model for our APIs as well. For that, we need to change our development and deployment process to follow a more of a bottom-up approach rather than a top-down approach as above. The following diagram gives an illustration of what that is:
Bottom-up approach for delivering APIs
Similar to deploying code, our APIs need to have continuous integration (CI) and continuous deployment (CD) from day one. We need to empower API developers to build, deploy, and test APIs until they are satisfied with the outcome before they can be deployed on to the portals and production gateways. The code of the APIs needs to be version controlled and managed using SCMs (Github). We need to use build automation tools such as Jenkins, Travis CI, and so on to automate the deployment of APIs into their respective environments. We also need to ensure that our developers are equipped with the proper tools required from day one to enable CI/CD on APIs and manage them similar to managing their source code of applications.
One of the key bottlenecks for adopting a bottom-up delivery model of APIs is the perceived lack of governance on APIs. API product managers are worried they would lose authority and governance of the APIs being published. This is a real problem to deal with. Organizations have a responsibility to ensure the APIs they publish follows correct standards, best practices, secured in the correct manner, and so on. Failing to do so will have a negative impact on an organization. So how do we deliver APIs in an agile manner and still have proper governance on our APIs?
The governance of APIs is essential. They are provided through a control plane that has powerful lifecycle management capabilities for APIs.
This is where the value of a good ‘Control Plane’ for APIs becomes important. The API control plane needs to support good lifecycle management capabilities of APIs so that API product managers can review and consent before APIs are published to the portals and propagated to upper environments through CI/CD. The ability to approve the design of APIs through configurable workflows, the ability to validate the schemas used for best practices and security, and so on are essential capabilities to have.
An API is a uniform interface that integrates heterogeneous microservices
APIs are no longer simple interfaces for HTTP (only) microservices. A modern enterprise architecture consists of many different types of microservices. You could have teams developing microservices that are exposed to gRPC, WebSockets, as serverless functions such as AWS Lambdas, and so on. An application that needs to consume these services however requires an easy to understand, uniform interface to access these services. An application would require a single authentication endpoint that grants the required access to the services, a single SDK for these services, a consistent interface that has a single source of consistent documentation, and so on. These are all granted to an application by an API layer. An API is therefore not a simple proxy to a set of services. It is a unit that deals with the nuances of integrating heterogeneous microservices and composes them into expo-sable uniform interfaces to be consumed by applications.
The success of APIs has encouraged many organizations to expose their business capabilities as public APIs. Many would start by exposing APIs for internal purposes only and then grow them to be publicly exposed as well. This rich adoption of APIs has seen massive growth in the last few years and resulted in APIs gaining huge popularity. This has naturally caused APIs to be a rich hunting ground of attackers to try and steal sensitive information or cause harm to organizations in other possible ways. We need to be on high alert regarding the security matters of APIs and consider API security as a high priority item. It should never be an afterthought and must always be a prominent checkbox to tick off whenever you deploy an API, even for the first time.
API security is almost always thought about in the forms of OAuth2.0 based authentication and authorization. While it is perfectly true that OAuth2.0 has established itself as the de-facto standard for API security, the security of APIs has to be given thought well beyond authentication and authorization. I view the security of APIs in 3 folds.
- Prevention of malicious content and DOS attacks
- Authentication and authorization
- Security through continuous learning of patterns identification of anomalies
The 3 folds of API security
Malicious content and DOS attacks
A client (attacker) making a request to an API has full control over the messages that it sends. These messages can go through many layers of services and if malicious, can cause potential harm deep into a system. These could be messages that are intended to perform injection attacks (SQL injections), very large messages that result in consuming a lot of server resources, messages that contain XML bombs, and so on. A malicious client app could also make a huge number of API requests causing the servers to run out of resources to serve the genuine users of the system (DOS attacks). A web application firewall or API gateway could be used to prevent these types of attacks on APIs. These are systems that could inspect the content of messages and validate them against predefined schemas or rules (patterns) and only accept messages that fall within the defined boundaries. They are also capable of rate limiting client requests to prevent clients from sending a huge number of messages within a very short time frame, thus preventing potential DOS attacks.
Although it is technically possible to use either an API gateway or a web application firewall to prevent these types of attacks, a web application firewall is better suited for the purpose. This is because web application firewalls are specialized in these types of security domains whereas API gateways are generally responsible for a lot more tasks in these systems. Security is a domain that should be specialized and something that requires a full time commitment to research and innovation. Security gateways should be updated as soon as new vulnerabilities are uncovered and patches are issued for them. This is something that is best done by systems that are specialized and focused on the domain. This blog post by Alissa Knight compares the responsibilities of each in detail and discusses why it makes sense to use both layers in an enterprise architecture.
Identity verification and authorization
The verification of identity and access control is something most of us in the API domain is familiar with. This consists of granting access to API resources based on a valid credential that could range from anything between an OAuth2.0 access token, API key, Basic Authentication header, client certificate, and so on. It also involves checking if the presented credential has the required level of permission to access the resource being requested. A system should not allow global access to its resources based on a valid credential only. It should have its systems designed in a way that either checks or at least leave provisions for performing further access control. For example, any user with a valid username and password should be allowed to read product details on an online retail store. But only users with admin permissions should be allowed to update its product detail. Access control sometimes goes well beyond role-based checks. There are also systems that would access control based on date and time (access allowed only between 8 am to 5 pm on weekdays), based on request quotas and so on.
API gateways specialize in these types of authentication and authorization checks. They abstract out these requirements into standard specifications and protocols and allow client applications to interact with them using these mechanisms, such as OAuth2.0 for example. Most API gateways solutions are also capable of propagating user context to downstream (back-end) APIs. Since these authentication and authorization checks terminate at the API gateway, the downstream APIs would by default have no context of the user making these requests. Downstream APIs sometimes need to know details of the user accessing the API to execute its own logic. It therefore becomes the responsibility of the API gateway to propagate user context to downstream APIs.
Continuous learning to identify patterns and detect anomalies
Stolen credentials are hard to track. If someone hacks your API key or access token, our firewall or identity verification layer alone is not going to be sufficient to detect that someone is using a hacked credential. This is one reason why OAuth2.0 access tokens are far safer in use compared to API keys or basic authentication credentials. OAuth2.0 access tokens have a (relatively) short time span and even if hacked it can only be used until the token expires. The detection of stolen credentials or improper use of credentials can only be detected by observing access patterns of users. If a token is used by a user in a particular country, and if the same token is used by someone in a different country a few minutes later, chances are that the token has been stolen and our systems should be able to detect such scenarios and either block the suspected sessions or require further authentication such as through MFA (multi factor authentication).
API gateways alone cannot protect systems against these types of attacks. API gateways are generally clustered across different networks and they may not necessarily share state and access history of users between themselves. API gateways, however, can work with data analytics solutions which include some form of machine learning and pattern analysis solutions. These systems would track user access history and patterns and alert the API gateways when something doesn’t seem right. The gateway can then take appropriate action with regards to validating user requests.
Auto scaling APIs
Many organizations are moving their infrastructures to Cloud. This move doesn’t come for free. It costs a good deal to run all or most of your enterprise IT on third party IaaS providers such as AWS, Google, Azure, and so on. All IaaS providers adopt a pay as you go model. Meaning that you pay for the number of resources that you use. It is therefore critical that you consume the optimal number of resources that are necessary for running your systems with minimal over-provisioning of servers. With traditional enterprise IT, we would capacity plan our system to be able to cater to a peak load. If our systems require 10 servers to cater average system load but require an extra 10 to cater peak load, we would run our systems with 20 servers to be on the safe side. When the organization itself owned the infrastructure this wouldn’t be a problem. But when consuming infrastructure from third party IaaS providers this would mean that we’re spending money on something that we barely use. To solve this problem we need our APIs (and API gateways) to be able to scale up and scale down on demand, fast. Autoscaling is one key characteristic of cloud-native enterprise architectures. And almost all IaaS providers provide facilities for autoscaling. However, for autoscaling to be effective our software needs to be able to scale up and down fast as well. Some points to consider with scaling our software are:
- Boot up delay
- Dependencies on other systems
- State replication
The faster your processes boot up, the easier it is going to be to scale up your system. If a process takes 30 seconds or more to startup, you need to start scaling your system at least 30 seconds before you actually need the process up and running. The longer a process takes to start, the earlier you need to start the scaling process. Scaling sometimes cannot be done for a single process alone. Your APIs and API gateways may depend on other helper processes for executing its functionality. This means that you need to consider scaling up/down these helper processes as well. The more independent your APIs and API gateways are, the easier it is going to be to scale your system. If your APIs and API gateways maintain state, either within themselves or externally, you will need to consider replicating the state of the system when scaling APIs. Stateless systems are usually much easier to scale compared to stateful ones. A fast booting, independent and stateless API is therefore ideal for auto-scaling systems.
The availability of systems in today’s world is becoming absolutely critical. The impact on a business due to downtime is increasingly becoming unbearable. We should aim for our APIs therefore to be 100% available, and that is no easy task. Assuming that we’ve taken care of the scale related matters discussed earlier, the key to focus upon here is resiliency. We’re all naturally adapt to creating robust systems. But something that we have to admit and accept is that at some point in time, something will fail. Something you did not anticipate is bound to happen and cause trouble. It is therefore crucial that we think about what happens on failure, how do we recover, and what kind of back-up systems do we have in place for our APIs. Everything we discussed in the previous section on scale is important to build resilient systems. In addition to that we also need to think about:
- How fast we can recover a system when/if it fails
- High availability of systems (data center availability, regional availability, and IaaS provider availability)
Recovering a system can be harder than it could appear. The more dependencies a system has, the harder it gets to recover upon failure. This is an area where containers and platforms such as Kubernetes can be a lifesaver. These kinds of platforms have auto-healing capabilities that provide much needed robustness to the system. Of course, they have their own complexities and limitations, but the level of robustness and ease of management they provide to our APIs are well worth it. As it is with scaling, the boot-up time of your APIs, their level of independence, and statefulness are important factors for recovering an API system. The more cloud native your APIs are, the easier they are going to be to recover upon failure.
High availability is something we’re all familiar with. It simply means having a backup for each server, process, filesystem, database, etc. in your system. But we also need to think about data center and regional availability zones. What happens if our entire data center or region goes down? If you are running your infrastructure on-premise, you need to plan to have a backup infrastructure in a different physical location. Your APIs need to be deployed in both locations. For that to be possible you need to build systems and processes which allow you to easily deploy APIs across multiple data-centers without adding more work-overhead to your API developers. And these need to be done efficiently with as much automation as possible. The same applies even if you are relying on infrastructure by IaaS providers. You need to think about availability zones and how you can replicate your data across various availability zones easily, with as little work-overhead as possible.
The availability of cloud service providers has also been put into a question a lot lately. What happens if a particular service of an IaaS provider fails, globally, even for a short time? Are your systems resilient enough so that you have a backup running on a different IaaS provider that you can cut over to? For example, what if the AWS RDS service fails on a given region for a short while? Would you have a back-up on Azure that you can cut over to? It is definitely not a good idea to put all your eggs in a single basket. I’ve worked with a number of customers who have successfully deployed their APIs across IaaS providers in different regions. It may sound like an expensive alternative to an unlikely problem. And yes, it would be unless carefully thought and designed. The key here is to build a scalable system that is distributed across IaaS providers and distribute the system load across the entire infrastructure. This way, you pay for what you use only and leave provisions to scale whenever necessary.
3 Type of insights for APIs
APIs are the driving force behind the economy/revenue of many digital enterprises today. As such, knowing how your APIs perform, knowing what works, what doesn’t and having a good set of data to do accurate course corrections is critical with APIs. API based insights can be split into 3 different categories such as:
- Operational insights
- Error diagnosis
- Business insights
Operational insights, also known as monitoring, is critical for any organization to ensure their APIs are healthy, thus ensuring a smooth-running business. Having a monitoring system for your APIs more often than not makes it possible for you to be informed something is about to go wrong before it happens. The opposite, of course, is that the absence of a monitoring system would only make you aware of failure after it has happened, typically through customer complaints. Needless to say, knowing a failure before it happens makes it much easier for you to deal with it, and with much less pressure as well. The best part about it is that your customers will never know there was a failure, given that you had taken necessary steps to negate the impact on customers caused by such failures.
Assume a situation where one of your services/APIs start running out of memory. A good monitoring system would detect the gradual growth of memory consumption of the API and alert when a particular threshold passes. This opens up a window of opportunity for system administrators to take immediate action that prevents your customers from being impacted by this incident. In such a situation, the typical course of action would be to start a few backup processes of the same. Since we are yet to identify and fix the root cause of the failure, it's highly likely the backup processes will start growing out of memory after some time as well. But having a set of backup processes gives us the opportunity to keep restarting the faulty processes and let the backup processes handle customer requests, and continue doing this in a round-robin fashion until the root cause is identified and fixed. So although we are running with faulty processes, there is zero customer impact due to the incident, which is a major win from a business perspective.
Once an incident has occurred and identified either through operational monitoring or customer complaints, the next immediate step to take is to identify the root cause of the incident and fix it. Error diagnosis plays a critical role in identifying the root cause. The speed at which you can get to the data for diagnostic purposes, the ease of doing so, and the amount of data you collect about the failure are all important factors to consider and implement. System logs are the first thing to look at to identify the cause of an incident. As such, it is important you collect all run time logs from your APIs and services and have them in an indexed form for easy and fast searching. The ability to get to the logs of system events that occurred during a particular time frame is also important. Once you have got to the logs of an incident, the logs itself might unveil the cause of an incident, such as an insufficient disk space error for example. However, there will be cases where the logs only give a hint of a cause and don’t necessarily unveil the actual cause itself. In such situations it is necessary to enable further logging, tracing, get memory dumps, network (TCP) dumps, and so on. You should aim to build systems that enable such troubleshooting at ideally zero, or minimal customer impact. One popular pattern for doing so is to isolate a set of faulty nodes into a separate cluster that either doesn’t receive customer traffic or only receives a small portion of customer traffic and perform troubleshooting on them.
As mentioned before in this article as well, APIs are the key revenue driver for many digital enterprises today. It is therefore natural for businesses to measure business success and growth through the use and adoption of its APIs. It also makes total sense to make APIs the driver of an organization’s business strategies. In doing so we need to make sure our APIs are delivering the business values and growth we are aiming for. To do that, it is vital we measure the business impact of APIs. We need a system that captures all data of APIs relevant to achieving our business goals. This could be things such as the number of new API consumers for a month, the number of new apps built on our APIs, response time improvements, user growth in a particular region (after a marketing campaign targeted towards that region), and so on. Since APIs are the entry point to an organization’s digital services, they are a great source to tap into and measure business KPIs.
Here is a summary of the key points discussed in this article
- Delivering value to our customers is our top priority. Microservice based architectures are built to be able to deliver reliable software at a faster pace, thus delivering better value to customers.
- APIs are the fundamental layer in enterprise architectures makes it possible to create digital experiences.
- Modern APIs are developed bottom-up, giving more prominence to the API developer and CI/CD processes.
- API governance plays a major role in ensuring APIs are delivered right and the right APIs are delivered.
- APIs should be composable. An organization should have the capability to compose heterogeneous collections of services into APIs.
- API security is 3 fold - content inspection, identity verification and authorization, and pattern analysis for abnormalities.
- APIs should be scalable to meet demands of the cloud native era.
- 9’s availability is no more. The demand is for 100% availability.
- API insights are vital in sustaining and growing a digital enterprise.