Streaming Data Integration: The Evolution of ETL
By Sajith Ravindra
- 16 Aug, 2019
The availability of accurate information on time is a crucial factor for a business to thrive. Organizations generate massive amounts of data about various activities and business operations they perform. However, data in raw format does not provide much value and it has to be processed using correct techniques to convert it into valuable information that’s beneficial to the business.
Converting data to information is just a part of the problem. In reality, an organization will consist of multiple operating units and it also interacts with external parties. Therefore, it’s essential to refer to data generated by multiple sources, both internal and external. Different sources will generate data in different formats and expose data in different ways. Integrating data from heterogeneous sources is an inevitable and challenging task.
Over time, the nature of data, data sources, and user demands have evolved. This has forced data integration and processing technologies to change substantially and fundamentally.
Traditional ETL for data integration
Data integration mainly involves delivering data from a source or multiple sources to a destination in an expected format. ETL aims to solve this problem by:
- Extracting data from a source or multiple sources
- Transforming data in different formats to a unified form expected by the destination
- Loading or storing the transformed data in the destination
Initially, when organizations start to realize the importance of data and data integration, the data architecture was mainly built around databases. Typically, an organization would have multiple operational databases, which stores data pertaining to different business functions and operations of an organization such as Sales, Marketing, Production, etc.
Data stored in different databases can be stored in various formats; even the same data can be stored in different formats in different systems. Therefore, the extracted data has to be transformed into a unified format, such that the data from different sources can be joined later. In this step, data is cleansed and validated against a predefined set of rules.
Figure 1: ETL
The transformed data is loaded into the target database, which is often referred to as a ‘data warehouse’. Data in the data warehouse is expected to be reliable and accurate. Business intelligence rules are run on top of the data in the data warehouse to generate reports and dashboards that help organizations make critical business decisions. Furthermore, techniques such as online analytical processing (OLAP) and data mining can also be used on top of this data to gather more insights.
The ETL tools eased the burden on the organizations who had to invest in developing their own tools for moving and analyzing data. They provide interactive UIs to design data pipelines and generate and view outputs. Mapping data between the source and the target, which could be a tedious task, can be done easily using an ETL tool with a graphical editor. Reports and dashboards could be designed and generated by business users who do not have technical knowledge thanks to such tools.
There are two major characteristics that should be noted when looking at traditional ETL.
- Due to the fact that early data architectures were predominantly based on databases, traditional ETL tools were designed to extract data from databases and load data into databases.
- The extraction, transformation, and loading of data are done as periodic batch jobs, which run only a few times a day, if not once. Data is consumed after a delay of several hours or days.
Changes in the data integration landscape
Need data in real-time
With increasing competition and demand for high-quality services and products, modern enterprises can’t afford to wait for days or hours for data to be available. In order to live up to the expectations of customers and to keep a competitive edge, it’s critical that up-to-date information is available. In other words, the availability of real-time data is an important requirement.
This adds another dimension to traditional data integration. Initially, data integration mainly focused on delivering data, but with new trends and requirements, this is not sufficient, rather, data has to be delivered on time.
Heterogeneous data sources
Modern applications disseminate data using various mechanisms in contrast to legacy applications, which stored data mostly in databases. Now, there are many types of data sources communicating through various mechanisms and protocols, such as mobile applications that fire out events, sensors that keep publishing metrics, or an audit log that records information about a system.
This requires pulling data from one data source and listening to events pushed by another data source. The data may be packed as XML, JSON, binary, or some other proprietary format. Moreover, this has to be delivered over TCP, JMS, HTTP, or simply written to a file through NFS. This diversity in data sources brings in more challenges to be solved when implementing data integration solutions.
Figure 2: Different types of data sources
Big data and fast data
As the value and importance of data continue to increase, systems are designed to be more verbose and produce more data so that valuable insights can be gathered by processing them. Integrating with multiple such sources is a norm in modern real-world deployments. This means that systems will have to capture, process, and integrate large volumes of data.
Another attribute that makes handling this large volume of data challenging is the high velocity. A system deployed in the modern enterprise will often be exposed to data generated at rapid rates as high as tens of thousands of records per second. Nowadays, it’s common that a system executes thousands of transactions per second owing to the large client bases an organization has to cater to. Consider the amount of information Amazon, eBay, or Facebook receives per second. Despite the high volume and velocity, data needs to be integrated in a graceful manner, without losing or interrupting the source with adverse effects such as back pressure.
Agile development and the need to scale
A modern-day real-world software system is a composition of a large number of components performing different tasks and it is often integrated with other external systems as well. From a data integration perspective, each component or an external system can act as both a source or/and a destination. With the dynamic and agile nature of software development practices in modern enterprises, data sources and destinations will require to join or leave systems frequently to allow rapid development.
It should be allowed to plug components to the system without substantial changes, and destinations should be able to continue work depending only on data without making assumptions about its publisher.
Furthermore, as the business grows, software systems will have to facilitate adding more functional requirements. This means more components will be added and it will require interacting with more external parties, which will result in many different complex data flows within the system. Thus, it’s essential that the system and the data integration architecture is scalable, such that it can scale up as the business grows. A poorly architectured system might end up in a tangled spaghetti state over time when more complexity is added. Therefore, it’s important that the systems are designed with agility and scalability as important design goals, which allow building robust data pipelines that can adapt to rapidly evolving environments.
Traditional ETL alone is not enough!
Designed for databases
Traditional ETL tools are designed targeting moving data from one database to another. This is due mainly to the fact that at early stages, data systems were based on databases. However, it’s not the case now. There are many different ways of storing and publishing data that have different characteristics.
Different types of data sources bring in different requirements and challenges to data integration systems. It’s essential that a modern data integration system should be able to extract data from various types of data sources, and they should be able to integrate data in-flight flowing through wire as well as data at rest in a database or a log file in a seamless and transparent manner.
Uses batch processing
The availability of up-to-date information is critical to make accurate and timely business decisions. Some use cases demand sub-second latencies. This demands data integration technologies to integrate data in real-time.
However, traditional ETL is based on batch processing. Extraction, transformation, and loading were done periodically as batch processes, where a large chunk of records are processed at once. These periodical tasks are executed only a few times a day, as these batch processes are resource-intensive and time-consuming; most organizations run them at night when computing resources are at a spare. Once data is loaded to the destination data warehouse, BI tools and other data analysis tools are used to analyze them periodically or on-demand to generate human consumable information.
The whole process of traditional ETL might take hours or days to deliver the intended final output; however, that’s not an acceptable wait time now. It’s critical that data is processed and integrated on the fly in real time to make actual use out of it.
Lack of scalability
In general, traditional ETL is about moving data from one point, that is the operational DBs, to another point, i.e., the data warehouse. Therefore, it suffers by inherent deficiencies of a point-to-point integration system. Scalability is one of the main architectural qualities a modern system should have to facilitate rapidly evolving environments.
For example, if another data tool that consumes the same data needs to be integrated into the system, then it needs to update the system to load data to the new tool. And, if the new tool expects the same data in a different format, it also requires to introduce a new transformation job. It’s possible that multiple sources may generate the same data in different formats, which will require multiple jobs to transform the same data. If one source changes its data format, it might require changes across the system. Over time, this will lead to a very messy and hard-to-maintain system.
This rigid nature of ETL architecture demands considerable effort to add and remove components to the system seamlessly. Therefore, the degree of scalability required in modern environments is challenging to achieve with traditional ETL.
The emergence of streaming data integration
In wake of the need of integrating data in real-time, the concepts of real-time ETL and enterprise application integration (EAI) became popular. Early generation real-time ETL tools were mainly built around techniques such as log-based change data capture (CDC) for databases, which allows applications to be notified about changes happening in the databases in real time. EAI technologies enabled applications to talk to each other in real time by passing messages.
Today, systems have to handle large volumes of data flowing at high speeds generated by many types of sources other than databases. Also, it should be anticipated that the amount of data that needs to be handled will increase over time. Furthermore, it’s important that data sources and consumers are able to connect or leave the system easily. The massive scale of data, diverse types of data sources, and requirements for high scalability have pushed early real-time ETL tools and EAI technologies to their limits. Streaming data integration emerged as a solution to overcome these limitations.
A stream can be thought of as a constant flow of data. Streaming data integration is all about integrating data as streams. In streaming data integration, all sources should publish their data as streams regardless of their nature. From the receiver's perspective, they only see data streams that they can subscribe to. The same stream can be subscribed by multiple clients at their will, without impacting the source. Similarly, producers can join and leave the system as required. This clear decoupling of sources and subscribers makes it horizontally scalable.
The delivery of data streams is done via a messaging system with pub-sub capability. This messaging system should be scalable and fault tolerant. This is where products such as Apache Kafka and NATS come into the picture.
Figure 3: Streaming data integration
The transformation of data is done on the fly in memory with streaming data integration and often uses stream processing techniques to process the data. Stream processing techniques can be used to perform a simple transformation of data or build a complex business logic involving temporal calculations using time windows or pattern matching. This is one of the major advantages of treating all data as streams, because it enables applying stream processing techniques on all data regardless of the nature of the data source. General-purpose stream processing engines, such as Apache Storm, Apache Flink, and WSO2 Stream Processor, can be used for ingesting data streams as they are often equipped with a large number of connectors.
Decades ago, organizations began to use ETL to integrate data and gather insights. Traditional ETL is designed to operate only with databases and uses batch processing. However, modern-day organizations can’t wait for days, hours, or sometimes even minutes for data to be available. Instead, data needs to be integrated in real time. Also, as databases are not the only sources for modern-day enterprises, traditional ETL techniques are not adequate to fulfill the data integration needs. EAI and real-time ETL came into use to fill these gaps. However, owing to increasing scale and diversity of data demands and performance, streaming data integration, which is a scalable data integration architecture, emerged as the solution that can handle large amounts of data. It treats all data as streams and in real time, performs transformations on the fly, and empowers users to utilize stream processing to build complex business logic.
- Overview on ETL Tools and Processes 2019: https://datavirtuality.com/blog-etl-tools-and-processes/
- 10 Charts that Will Change Your Perspective of Big Data: Growthhttps://www.forbes.com/sites/louiscolumbus/2018/05/23/10-charts-that-will-change-your-perspective-of-big-datas-growth/#596d23ae2926
- Sajith Ravindra
- Senior Software Engineer