Exploring Event-Driven Architecture: A Beginner's Guide for Cloud Native Developers
- Srinath Perera
- Chief Architect - WSO2 Inc.
Introduction
Classical request/response architecture is driven by procedure calls, where a caller waits for the call to finish, and the call continues until the work is done. Each incoming call is broken into smaller procedure calls, which may in turn call other procedures. In contrast, event-driven architecture (EDA) is driven by events such as user actions, sensor outputs, or messages from other programs/threads, which determines the execution flow. In this programming style, the program sets up event handlers (i.e., functions that should be executed when an event occurs) and waits for events to occur, executing the associated event handler when it does. This may emit new events that can trigger more event handlers. A single program may involve multiple events triggering, and there may or may not be any response.
The goal of cloud native applications is to make use of the cloud’s unique features, such as self-service, scalability, and pay-per-use. A typical cloud native application consists of services and data stores, where the users or their clients call some of the services (we call them APIs) and services talk to data stores and each other to serve those requests.
Request/response style services are natural extensions of procedure calls to distributed systems (in this case, to the cloud). Adding EDA can enrich your architecture by enabling patterns such as fire and forget, publish/subscribe, event queues, etc.
For example, consider communicating via a call vs. over email. In the former, we wait for the call to be done, and we can’t deliver the message if the other side isn’t also available to pick up the phone. The latter doesn’t wait, and with emails, delivery happens even if the other side isn’t available at the time of delivery. If you want to agree on something, you’ll call someone. However, if you’re coordinating a project, you would often do it via status updates through emails. I don’t argue that EDA is better than request/response architecture, but there are use cases where each is better than the other for that use case. There is a reason our mobile phones support both calls and SMS, as there is a place for each.
Hence, EDA is a useful tool in your cloud native architecture toolbox. It’s typically used with use cases such as alerts, fraud and anomaly detection, monitoring, managing complex processes, delivering events to IoT/ML systems, and event-based integration.
Key Concepts
Figure 1: EDA concepts
An event is a "significant change in state" sent from one actor of the architecture to another (https://en.wikipedia.org/wiki/Event-driven_architecture). There may or may not be a response, and, in some cases, there is an acknowledgment of reception. An event source is anything that can generate events. Event handlers listen to events and when they run, they execute all or part of the EDA program.
To avoid event handlers and event sources having to be available at the same time (to avoid coupling in time) and having to know each other (to avoid coupling in space), we use an intermediary known as an event broker or message broker. Event sources send events to the broker, and the broker in turn sends events to the event handlers. Based on how the broker delivers events to event handlers, there are two models.
In thepublish/subscribe model, interested event handlers subscribe to a topic in the broker. A topic is a string that represents something meaningful in the application domain. When the event source sends an event to a topic, the broker will deliver the event to all active subscribers (the event handler who has subscribed). This model is also called the event bus. We sometimes use the term notification model to refer to the delivery part of the publish/subscribe model, where event handlers subscribe for events with an event source (with or without a topic), and the event source sends events to subscribers.
In the queue model, interested event handlers listen to a queue in the broker. A queue is also a string that represents something meaningful in the application domain. When the event source sends an event to a queue, the broker will deliver the event to a subscriber (the event handler who has subscribed).
In the publish/subscribe model, an event is sent to all subscribers, while in the queue model, an event is sent to one subscriber at most.
The broker provides additional guarantees known as QoS guarantees.
- Persistence - Save all messages to the disk so that messages are not lost even though broker or event handlers have failed.
- At least once delivery - Event handlers will receive each message at least once (no message loss, but there may be duplicates).
- Exactly once - Event handlers will receive the message exactly once (no duplicates and no message loss).
- In-order delivery - Event handlers will receive events in the same order they were sent.
Event handlers should choose the guarantees they require as part of the configurations. If their brokers support it, event handlers can choose to process the event first and acknowledge the broker on message reception after processing to ensure end-to-end guarantees of message processing.
Two examples of message brokers are Apache Kafka and RabbitMQ. We can use a hosted version of a broker or use a platform service provided by cloud providers such as Azure Event hub.
An EDA wires the event handlers using the above concepts to deliver the logic of the EDA program. For example, the following diagram shows an EDA to handle medical reports generated by a hospital by interested parties.
Figure 2: EDA-based medical report processing for a hospital
Implementing EDA
There are many ways to implement EDA and Figure 3 shows a typical architecture.
Figure 3: Event-driven architecture
With this design, we implement EDA using a broker. Event handlers listen to topics or queues, process events, and generate new events that trigger the next steps. We can configure a broker to persist all events to circumvent message loss due to failures.
A typical design does not expose brokers outside the organization, and communication needs to happen through a proxy defined using other protocols. For example, a proxy may be any one of the following.
- REST/JSON call some other incoming remote procedure call.
- Events received through a WebSocket or gRPC call (both incoming and outgoing connections may push events in this case).
- XMPP – Extensible Messaging and Presence Protocol is an open, XML-based communication protocol for instant messaging, presence information, and real-time communication (e.g., chats).
- Polling Client – a client that runs periodically and checks a web page or a service.
- Websub event – a distributed content notification protocol between subscribers and clients.
- GraphQL endpoint – a GraphQL endpoint may send notifications.
- Webhook – a client can register an HTTP endpoint, and the server will notify when an event has occurred.
- Async API – describe an event source as an API. The description includes message formats, message topics, or channels, as well as other configurations such as security. The client can connect and receive events using the description.
An outgoing proxy can also be one of the above, except for the polling client.
A typical design does not expose brokers outside the organization, and communication needs to happen through HTTP or a variation of HTTP such as WebSocket or HTTP2.0.
We can implement event handlers using any programming language and run them as containers, services, or cloud functions (FaaS). The system will create a new container per event if we use containers. We can use a service configured with scale-to-zero settings if we want to reuse the container. However, we need a bridge that pulls data from the message broker and calls the service.
Since we will use containers to implement event handlers and because we start them on need, event handlers should not assume that in-memory state is preserved across two event processings. If an event handler has any state, it needs to store the data in a database or some other storage, such as a Redis cache.
As we discussed, we can use a message broker for events within the organization. However, crossing organizational boundaries can be done using many protocols and thus requires careful consideration. Let us consider several common use cases.
- An external system triggers an event into our system - We typically do this using a webhook. For example, let’s assume we want to be notified when a new pull request is submitted to GitHub. We must write a webhook and register it with GitHub. When the event happens, GitHub calls the webhook.
- Client submitting a task and receiving updates- There are two ways to implement this. The first, and most recommended way is to use a full duplex channel such as websocket, GRPC, or HTTP2, where we can keep the connection open, and either side can push messages into the connection. The second, less ideal method is to submit the task as a service call and then use polling to get the updates.
- Our system triggers events for clients who have shown interest - this is the reverse of the first use case. The best way to implement this is to include webhooks into our APIs and let users register webhooks that we will call when events occur.
- A complex asynchronous interaction between the client and the system (e.g., a game) - Full duplex protocols like websockets, GRPC, or HTTP2 are the best way to implement this use case. Either the client or server can trigger events to the other side as needed.
Advanced EDA Implementations
When an event processor needs to wait for another specific matching event to occur, we call that event correlation. Handling event correlation in an event handler is complex. The following are some examples of use cases that require event correlation.
- Scatter, gather - wait for all work to finish
- Reordering events
- Detect patterns such as happened before
- Event windows-based processing
- Decoupling in time (participants do not have to be available at the same time).
- Synchronization (participants do not need to wait for each other).
- Space (participants do not need to know each other’s address).
- Scalability - by placing event handlers in many nodes that communicate via a message broker, EDA is easier to scale. Often, we can do this on-demand without changing the code.
- If appropriately implemented, the applications can be more fault tolerant thanks to the exactly once message guarantee.
- EDA is resilient to message bursts. They can save messages to disk and process them lazily, and as long as the system can keep up with the long-term average message rate, the system will work. Also, the asynchronous nature (clients are not waiting for a response) avoids timeouts due to bursts.
- Complexity – The EDA programming model is perceived to be complex. Envisioning an application as a set of event handlers connected by events requires practice. Also, EDA does not let us reason about logic as a series of statements like in request/response architectures. Complex use cases that involve correlation are even more challenging.
- Debugging is hard – The flow of execution is nontrivial (e.g., there may be no response). Furthermore, there is no simple equivalent of a stack trace that lets you track where the execution failed. You will have to search logs or search queues (if persistent queues are used) to find the execution’s outcome.
- Lack of skilled developers – Few developers have experience with EDA and understand it deeply.
- Act on an event – The most common EDA use case is waiting for an event and acting (e.g., notify someone, update a database).
- Alerts – Listen to a series of events and generate an alert in response to a condition. We may use events from live data or stored data.
- Detecting anomalous events
- Sending product availability alerts and recommendations to the users
- Data transformation – Convert data from one format to another. In data warehousing, transformations are called extract, load, transform (ETL) processes.
- Fraud and anomaly detection – Monitoring financial transactions and instantaneously responding to tentative fraudulent transactions.
- Monitoring and managing complex processes, responding to any problems, and reducing risk. The following are some example processes:
- Supply chains and logistic networks and responding to any problems
- Patient care processes
- Customer service processes
- Manufacturing production lines
- Energy grid
- Inventory management
- Gaming: EDA can act as the backbone for multiplayer gaming, allowing soft real-time updates and interactions with other players and providing a more immersive gaming experience.
- Monitoring and responding to events for IoT sensor networks.
- Collecting, processing, and delivering events to IoT systems.
- Collecting, processing, and delivering events for machine learning and analytics systems.
- Sensing patterns listening to the firehose of events. The following are examples of data:
- Stock alerts;
- Real estate listings, updates, and transactions
- Sensing social media, looking for consumer behavior and preferences
- Sensing customer transactions and touchpoints, looking for patterns
- Backend processing – Responding to events etc. (e.g., marketing pipelines triggered by user activities.
- Microservices architecture – We often use EDA to update data across microservices.
- Edge computing – We can use EDA with edge computing to filter and or aggregate data at the edge.
- Merging and processing multiple streams of data – We use EDA to listen, merge, and process data coming from various streams.
- Event-based integration – We use EDA with enterprise integration patterns (EIP), such as pipes and filters, scatter gather, aggregator, etc. (I’ve omitted patterns that are already mentioned).
- Developers need to know available data streams and their formats. Like API documentation, we need to document and communicate available streams and event formats so that others can reuse them.
- Multiple event streams may use similar schemas, and schemas may evolve. Our data models and event handlers also need to handle those.
- At runtime, some events may be missing, arrive late, or arrive out of order. Our design needs to handle those.
- Just like any architecture, EDA needs to consider and handle security.
- Debugging within EDA is complex. The team needs to use best practices and tooling to enable tracing and troubleshooting of EDA applications.
- For some use cases, EDA is better than request/response architectures. Hence, it’s a useful tool to have in your cloud native architecture toolbox.
- Event sources generate events. Event handlers receive them directly or through a message broker. Message brokers support publish/subscribe or queue-based communication, with QoS such as persistent, at least once, exactly once, and in-order delivery.
- An EDA wires the event handlers using the above concepts to deliver the logic of the EDA program.
- We discussed a default EDA design done with a message broker that works for most use cases. More complex designs are possible.
- To support cloud native architectures, we need to keep event handlers stateless and run them on top of Kubernetes cluster or a FaaS. Also, we typically proxy incoming calls via a standard protocol rather than exposing the message brokers beyond the firewall.
To implement event correlation, event handlers need to save state in a database and check future events against that condition. Sometimes it’s possible to create a topic for the event you need to wait for. However, often you need to wait for dynamic conditions, which can’t be handled topics. To handle complex event correlation, we use stream processing or workflows. Both have special algorithms supporting event correlation.
Workflows (e.g., BPMN specification or an implementation like camunda.io) provide persistent long-running executions. Using correlations, you can restart an execution when a specific message has arrived (based on conditions about message content or header content). You can implement full EDA programs using workflows, or you can choose to mix it with the aforementioned message broker-based architecture.
Stream Processing (a.k.a. CEP or complex event processing) provides the most power, including features such as happened-before patterns and windows. We can use event processing or CEP tools such as Siddhi, Azure Stream Analytics, or Kafka Streams when we require such complex event processing. Typically, we use event processing with event broker-based architecture as a specialized event handler for detecting complex events.
Advantages and Disadvantages
We already discussed that EDA is a better fit for some use cases than request/response architecture. As discussed, it can provide:
I encourage you to refer to the ACM Computing survey paper, The Many Faces of Publish/Subscribe, for more details.
Many advantages result from the aforementioned decoupled nature, for example:
However, EDA carries three primary disadvantages.
Due to these reasons, we should only use EDA when its advantages are overwhelming. Otherwise, the associated complexity will eat into whatever gains you achieve through it.
Use Cases
EDA is not a great choice with client-facing applications, where we need to respond within a few seconds. However, it is often a good choice for backend processing, especially when the system waits for an event to occur and acts in response.
One rule of thumb to remember is that EDA is optimized for throughput (how much work is done per unit of time), while the request/response design is optimized for latency (how fast the system will respond). Here are some common use cases.
Challenges and Considerations
Most of the data handling challenges apply to EDA as well. The following are some of the common challenges.
If you’re starting an EDA project, I recommend considering how we will handle each challenge.
Conclusion
Due to their loosely coupled nature, EDA programs fit well with cloud native architectures. They can scale easily with stability and fault tolerance. It’s a great addition to your cloud native design toolbox. However, EDA is complex and harder to debug, and it’s hard to find developers who know it well.