What is Stream Processing?
- Srinath Perera
- Chief Architect - WSO2
Stream processing is a technology that let users query a continuous data stream and quickly detect conditions within a small time period from the time of receiving the data. The detection time period may vary from a few milliseconds to minutes.
For example, with stream processing, you can query a data stream coming from a temperature sensor and receive an alert when the temperature reaches the freezing point.
Stream processing is also known as real-time analytics, streaming analytics, complex event processing, real-time streaming analytics, and event processing. Although some terms historically had differences, tools have now converged under the term stream processing (see Stream Processing vs. CEP for details).
It’s one of the big data technologies that was popularized by Apache Storm. It now has many contenders.
Why is Stream Processing Needed?
Big data established the value of insights derived from processing data. The value of such insights is not created equal. Some insights have much higher values shortly after it has happened and that value diminishes very fast with time. Stream processing targets such scenarios. The key strength of stream processing is that it can provide insights faster, often within milliseconds to seconds.
Stream processing is introduced and popularized as a “technology like Hadoop but can give you results faster”.
Following are some of the secondary reasons for using stream processing.
- Some data naturally comes as a never-ending stream of events. To do batch processing, you need to store it, stop data collection at some time and process the data. Then you have to do the next batch and then worry about aggregating across multiple batches. In contrast, streaming handles neverending data streams gracefully and naturally. You can detect patterns, inspect results, look at multiple levels of focus, and easily look at data from multiple streams simultaneously.
- Stream processing naturally fits with time series data and detecting patterns over time. For example, if you are trying to detect the length of a web session in a never-ending stream (this is an example of trying to detect a sequence), it is very hard to do it with batches as some session will fall into two batches. Stream processing can handle this easily. If you take a step back and consider, the most continuous data series are time series data. For example, almost all IoT data are time series data. Hence, it makes sense to use a programming model that fits naturally.
- Batch lets the data build up and try to process them at once while stream processing processes data as they come in, hence spread the processing over time. Because of this stream processing can work with a lot less hardware than batch processing. Furthermore, stream processing also enables approximate query processing via systematic load shedding. So stream processing fits naturally into use cases where approximate answers are sufficient.
- Sometimes the amount of data is huge and it is not even possible to store it. Stream processing lets you handle large firehose style data and retain only useful bits.
- Finally, there is a lot of streaming data available (e.g. customer transactions, activities, website visits) that will grow faster with IoT use cases (all kind of sensors). Streaming is a much more natural model to think about and program those use cases.
Stream processing is however not a tool for all use cases. One good rule of thumb is that if processing needs multiple passes through the full data set or it has random access (think a graph dataset), then it is tricky to apply streaming. Machine learning algorithms to train models also can’t use stream processing. On the other hand, if processing can be done with a single pass over the data or has temporal locality (processing tends to access recent data) then it is a good fit for streaming.
How Can You Do Stream Processing?
If you want to build an app that handles streaming data and takes real-time decisions, you can either use a tool or build it yourself. The right choice depends on how much complexity you plan to handle, how much you want to scale, and how much reliability and fault tolerance you need among other things.
If you want to build the app yourself, place events in a message broker topic (e.g. ActiveMQ, RabbitMQ, or Kafka), write code to receive events from topics in the broker (they become your stream), and then publish results back to the broker. Such a code is called an actor.
However, Instead of coding the above scenario from scratch, you can use a stream processor to save time. An event stream processor lets you write logic for each actor, wire the actors up, and hook up the edges to the data source(s). You can either send events directly to the stream processor or send them via a broker.
An event stream processor will do the hard work of collecting data, delivering it to each actor, making sure they run in the right order, collecting results, scaling if the load is high, and handling failures. Some examples are Storm, Flink, and Samza. If you like to build the app this way, please check out respective user guides.
Since 2016, a new idea called streaming SQL has emerged. It’s a language that enables users to write SQL like queries to query streaming data. There are many streaming SQL languages on the rise:- Projects such as WSO2 Stream Processor and SQLStreams have supported SQL for more than five years
- Apache Storm added support for Streaming SQL in 2016
- Apache Flink have support for Streaming SQL since 2016
- Apache Kafka added support for SQL (which they called KSQL) in 2017
- Apache Samza added support for SQL in 2017
- Apache Beam added Streaming SQL in 2017
With streaming SQL languages, developers can rapidly incorporate streaming queries into their Apps. By 2018, most of the stream processors support processing data via a streaming SQL language.
Let’s understand how SQL is mapped to streams. Think of a never-ending table where new data appears as the time goes. A stream is such a table of data in motion. One record or a row in a stream is called an event. But, it has a schema and behaves just like a database row. To understand these ideas, Tyler Akidau’s talk at Strata is a great resource.
The first thing to understand about SQL streams is that it replaces tables with streams.
When you write SQL queries, you query data is stored in a database. Yet, when you write a streaming SQL query, you write them for data that is currently there as well as data that will come in the future. Hence, streaming SQL queries never end. But isn’t that a problem? No, it works because the output of those queries are streams. The event will be placed in output streams once the event matches and output events are available right away.
A stream represents all events that can come through a logical channel and it never ends. For example, if we have a temperature sensor in a boiler we can represent the output from the sensors as a stream. However, classical SQL ingested data stored in a database table processes them and writes them to a database table. Instead, the above query will ingest a stream of data as they come in and produce a stream of data as output. For example, let's assume there are events in the boiler stream once every 10 minutes. The filter query will produce an event in the result stream immediately when an event matches the filter.
So you can build your app as follows. You send events to the stream processor by either sending it directly or via a broker. Then you can write the streaming part of the app using streaming SQL. Finally, you configure the stream processor to act on the results. This is done by invoking a service when the stream processor triggers it or by publishing events to a broker topic and listening to the topic.
There are many stream processors available (see Quora question: What are the best stream processing solutions out there?). I would recommend the one I have helped build, WSO2 Stream Processor. It can ingest data from Kafka, HTTP requests, and message brokers, and you can query data stream using a streaming SQL language. WSO2 Stream Processor is open source under the Apache Licence. With just two commodity servers, it can provide high availability and can handle 100K+ TPS throughput. It can scale up to millions of TPS on top of Kafka.
Who Uses Stream Processing?
In general, stream processing is useful for use cases where we can detect a problem and have a reasonable response to improve the outcome. It also plays a key role in a data-driven organization.
Following are some of the use cases:
- Algorithmic trading and stock market surveillance
- Smart patient care
- Monitoring a production line
- Supply chain optimizations
- Intrusion, surveillance and fraud detection (e.g. Uber)
- Most smart device applications: Smart car, smart home, etc.
- Smart grid - (e.g. load prediction and outlier plug detection - see Smart grids, 4 billion events, throughout in range of 100Ks)
- Geofencing, vehicle, and wildlife tracking - e.g. TFL London
- Sports analytics - Augment sports with real-time analytics (e.g. this is work we did with a real football game - Overlaying Real-time Analytics on Football Broadcasts)
- Context aware promotions and advertising
- Computer system and network monitoring
- Traffic monitoring
- Predictive maintenance (for example, check out this InfoQ article: Machine Learning Techniques for Predictive Maintenance)
- Geospatial data processing
For more discussions about how to use stream processing, please refer to 13 Stream Processing Patterns for Building Streaming and Real-time Applications.
I hope this was useful! To learn more about streaming SQL, please check out the article on Stream Processing 101: From SQL to Streaming SQL. To try out stream processing, check out WSO2 Stream Processor.