sa
2014/12/22
22 Dec, 2014

[Article] Bloom Filter Integration for Joins in Large Siddhi Windows

  • WSO2 Team
  • Content Writer - WSO2
Archived Content
This article is provided for historical perspective only, and may not reflect current conditions. Please refer to the WSO2 analytics page for more up-to-date product information and resources.

Contributors to this article

  • Ramindu De Silva
  • Dhanushka Priyasad
  • Thilini Anoratna

Table of contents

  • Introduction
  • Overview
    • Simple window
    • Bloom filter based-window
      • Counting bloom filter (CBF)
      • Overlapping/multiple bloom filter (OBF)
  • Implementation
  • Performance analysis
  • Limitations
  • Conclusion
  • References

Introduction

Today, enterprises have continuous transactions that need real-time processing to generate valuable information. The fast-moving market makes it inevitable for enterprises to be more agile and responsive. WSO2 Complex Event Processor (CEP) identifies the most meaningful events within the event cloud, analyzes their impacts, and acts on them in real time.

Siddhi is the event processing engine that is used by WSO2 CEP. It processes events that are triggered by various event sources and notifies appropriate complex events according to the user specified queries [1].

A Siddhi Window contains a limited subset of events. Events can be selected either using a time window or a length window. Output can be obtained by using a batch window or a sliding window [2]. This project focuses on the sliding window type that triggers whenever a new event is added.

Below is a sample window that contains a maximum of 2,000 events and Events will be added from the TickEvent stream.

  
from TickEvent[symbol==’IBM’]#window.length(2000) into filterdEvents

When an event arrives at the NewsEvent that will be matched with the last 2000 TickEvents, and if the TickEvent’s symbol == NewsEvent’s company, the output event will be generated and sent via JoinStream.

  
from TickEvent[symbol==’IBM’]#window.length(2000) as t 
join NewsEvent as n 
on t.symbol == n.company 
insert into JoinStream 

After analyzing test data results, we have explained how performance of the large length and time windows can be increased with the integration of bloom filters.

Overview

Simple window

Since a comparison should take place between window implementation with and without a bloom filter integrated, a sample Window was implemented with the basic functions of add and search for an event. In the implemented simple window, all the incoming events enter and the search is done for the join condition throughout the window.

Bloom filter-based window

A Bloom filter is a data structure designed to check, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure, which tells whether an element is “definitely not” in the set or “may be in the set” [3].

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set of elements to one of the m array positions with a uniform random distribution [4]. When choosing a hash function, “MurmurHash” was used because its non-cryptographic, fast, uniformly distributed and simple to use [5]. A bloom filter has several types and extensions for its implementation. We have used Hadoop implementation of a simple bloom filter and a counting bloom filter.

If a Bloom filter is used and whenever a join event comes, searching the entire window will not be necessary. A Bloom Filter can be used in order to check whether an event exists or not. A Bloom filter will result with a boolean value as follows;

  • true - the searching value “maybe” exist in the window, So that we need to search the window and join with the events if it exists.
  • false - the searching value “definitely” doesn't exist in the window, So that we can skip the searching in the window [6].

Hence, with the usage of a Bloom Filter for a Siddhi Window, join will be efficient.

Counting bloom filter (CBF)

Counting filters provide a way to implement a delete operation on a Bloom filter without recreating the filter a fresh. In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter.

Figure 1

According to Figure 1, the first event stream will go through both Window and Filter. The window can only accommodate for a specific limit (Time or Length). When it reaches the limit, it will remove the oldest event from the window and insert the newest member. When the second event stream comes, it is checked with the first stream using the bloom filter whether it has matching events or not.

The filter gives a boolean output where “false” means that the event has no matches. But if the filter gives “true”, it means an event might have a matched event in the first stream. If an event is passed from the membership test of the filter, next it is checked with the Window and if the event actually matches, a matching event list will be returned.

Overlapping/multiple bloom filter (OBF)

Figure 2

According to Figure 2, in the overlapping bloom filter implementation, each and every bloom filter will be initiated and added to an array list, right after its previous filter completing adding number of events, which is equal to the non-overlapping percentage of the window size. And whenever a new filter is added, the “newestAddingFilter” will be increased by one and whenever the oldest filter contains the maximum amount of keys, “oldestAddingFilter” will be increased by one. New events will be added to the bloom filters from “oldestAddingFilter” to “newestAddingFilter”.

Note: “oldestAddingFilter” and “newestAddingFilter” track the upper and lower bounds of the filters respectively.

Figure 3: Non-overlapping events

Searching will be done from the 0th bloom filter to the “oldestAddingFiter” since all the current events reside between those two filters (filters with expired events gets removed accordingly to the requirements). As per the return values,

  • true - search will be carried through the eventQueue because the searched key might be in the queue.
  • false - search will not be carried out because definitely the searched key wont be in the queue.

Implementation

Below is a sample of how to initialize an Overlapping-based window.

  
window = new OverlappingBloomFilterBasedWindow();

Expression[] expressions = {  new IntConstant(10), 
                              new IntConstant(1), 
                              new IntConstant(0), 
                              new IntConstant(15825), 
                              new IntConstant(0) };

window.init(expressions);
window.add(StreamEvent);
window.getLast(true);

Parameters are as follows;

Expression[] is a data structure which passes data for the window.

expressions[0]: Expected number of events in the window

expressions[1]: Array which contains the join attribute ID

0 - beforeWindowData

1 - onOrbeforeWindowData

2 - outputDataWindowData

expressions[2]: Array location of the join attribute id, since each above array might contain several other attributes of the StreamEvent as well.

expressions[3]: Bloom filter size (can be calculated using the following equation)

filter size = (-1 * (noOfElements *log(falsePositiveRate)) /(log(2)2)

Note: “falsePositiveRate” defines the number of false positives produced by the filter but not

from the window

expressions[4]: Overlap percentage of the filter (for overlapping based window)

Percentage: 0 - 100 (excluding 100)

Implemented CBF based window can be initialised in a similar manner except not having the expressions[4] parameter because it does not overlap.

Performance analysis

Throughput was obtained for 3 window types under the following conditions assuming the windows perform as length windows;

  • Window Type = Simple, Counting Bloom Filter based, Overlapping Bloom Filter based
  • Window Size - No of events in a window at a given length
    • 10000, 20000, 50000, 100000
  • Match Rate - The two events streams will have a specific event match rate
    • 1%, 10%, 20%
  • Overlap Percentage for Overlapping Bloom Filter - Percentage of overlapping number of events between two filters
    • 0%, 10%, 20%, 50%, 40%, 60%, 80%
  • False Positive Rate for OBF - Expectancy of the false positive ratio.
    • 0.0005

Figure 4: Throughput for all window types given a specific match rate

According to Figure 1, it is depicted that both bloom filter-based windows perform way better than a simple Window without a filter implementation. Throughput was high for both of the filter-based windows and among them overlapping bloom filter based-window showed the best performance.

Figure 5: Overlapping Bloom Filter

For the OBF-based window, the throughput depends on the overlapping percentage of the filters. Figure 5 shows the results that were obtained for several overlap percentages with a specific match rate.

We have initially implemented a simple window that represents the basic functions of a Siddhi window. Then we implemented a counting bloom filter-based window (CBF) to search whether a specific element exists or not. By comparing the throughput of the simple window with the CBF, we observed that the throughput has immensely improved for the search using a bloom filter. Next, we implemented an overlapping bloom filter (OBF) that reduces the removal cost of elements from the filter. After comparing the three types, it was observed that the highest throughput is given by OBF given that the windows operate as length windows.

Limitations

Expected number of events (passed within the init method) should be a value approximately equal to the actual window size.

  • If the window size is very large than the expected value;

    CBF - When the events are kept longer than expected in the window, the bloom filter gets filled with keys and results with a ‘maybe’ every time a join event comes. In that particular scenario, the performance will be reduced drastically even more than the window without a filter implementation.

    OBF - Filter will be created expecting less number of events than the real scenario. Hence, it might need more filters to add the events. More filters will consume more memory and time since the searching has to be done from oldestAddingFilters to newestAddingFilters. So the overall performance might reduce.

  • If the window size is very small than the expected value;

    The filter will be created expecting a large number of events, but in a real scenario, the filter will be filled with several adding sessions so it might contain already expired events from the window. Hence, the filter will give more false positives, which might exceed the defined false positive rate.

Figure 6

Conclusion

When windows are large in the Siddhi Engine, that information needs to be stored in a disk or in an in-memory grid. The objective is to support efficient joining by using bloom filters.

An integrated bloom filter-based window is capable of creating the bloom filters according to the given expected number of events to the window. Before searching the window as usual, it checks with the bloom filter and returns a boolean. Hence, the search for a non-existing event in the window will be more efficient than the prevailing method. Implemented counting bloom filter and overlapping bloom filter-based windows perform an efficient search using their membership tests and outputs a matched event list for a specific complex event request. The above-mentioned filter-based windows were implemented assuming as length windows. They can be also used in time windows, but there are several facts to be concerned as mentioned in the limitations.

References

  1. https://srinathsview.blogspot.com/2011/12/siddhi-second-look-at-complex-event.html
  2. https://docs.wso2.com/display/CEP310/Windows
  3. https://billmill.org/bloomfilter-tutorial/
  4. https://ilyasterin.com/blog/2010/02/implementing-bloom-filter-with-a-murmur-hash-function.html
  5. https://burtleburtle.net/bob/hash/doobs.html
  6. https://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
 

About Author

  • WSO2 Team
  • Content Writer
  • WSO2