The ACM DEBS Grand Challenge is a yearly competition where the participants implement an event-based solution to solve a real-world high-volume streaming data problem.
This year’s grand challenge involves developing a solution to solve two (real world) problems by analyzing a social-network graph that evolves over times. The data for the DEBS 2016 Grand Challenge has been generated using Linked Data Benchmark Council (LDBC) social network data generator. The ranking of the solutions is carried out by measuring their performance using two performance metrics: (1) throughput and (2) average latency.
WSO2’s been submitting solutions to the grand challenge since 2013, and our previous grand challenge solutions have been ranked as one of the top solutions among the submissions. This year, too, we submitted a solution using WSO2 CEP/Siddhi. Based on its performance, this year’s solution has also been selected as one of the best solutions. As a result, we’ve been invited to submit a full paper to the DEBS 2016 conference to be held from 20 June to June 24.
In this blog I’ll present some details of DEBS queries, (a brief) overview our solution and some performance results.
As pointed out earlier, DEBS 2016 involves developing an event-based solution to solve two real world use cases of an event processing application.
The first problem (query) deals with the identification of posts that currently trigger the most activity in a social network. This query accepts two input streams namely the posts and comments.
Think of a Facebook post with comments. Our goal is to compute the top three active posts where the score of a post is computed as the sum of its own score and the score of its related comments. The initial score of a post is 10 and it decreases by 1 every 24 hours. Similarly, the initial score of a comment is also 10 and decreases by 1 in the same manner.
Note that the score of a post/comment cannot reach below zero; a post whose total score is greater than zero is defined as an active post.
The second deals with the identification of large communities that are currently involved in a topic.
This query accepts three input streams: 1) comments 2) likes and 3) friendships.
The aim is to find the k comments with the largest range, where the comments were created more than d seconds ago. Range here is defined as the size of the largest connected components in the graph defined by the persons who have liked that comment and know each other.
The friendship stream plays an important role in this query, as it establishes the friendships between the users in the system. The following figures shows the friendship graph when the system receives 10 and 100 friendship events respectively.
Figure 1: Friendship Graph (Number of Events = 10)
Figure 2: Friendship Graph (Number of Events = 100)
Further analysis of the friendship graph indicates that the degree of distribution of the friendship graph is long-tailed (see Figure 3). This means that there are very small number of users who have a large number of friends and a large number of users have a few friends.
Figure 3: Degree Distribution of Friendship Graph
We implemented the solution using WSO2 CEP as an extension to Siddhi. The solution is a multi-threaded: it processes the two queries in parallel.
Each query is processed as a pipeline where the pipeline consists of three phases: 1) data loading, 2) event-ordering and 3) processing. Each phase in the query is processed using one or more threads. In the data loading phase the data streams are loaded from the files (i.e.disk) and placed in (separate) buffers. Each event stream has its own buffer which is implemented as a blocking queue.
The purpose of the event-ordering phase is to order the events based on their timestamps prior to sending them to the event processor (note: As far as events in an event buffer is concerned, they are already ordered based on their timestamps. The purpose of the ordering done in this phase is to ensure that the merged event-stream that is sent to event processor is ordered based on their timestamps). The core calculation modules of the queries are implemented in the processing thread.
The solution was tested on a four core/8GB virtual machine running Ubuntu Server 15.10. As discussed earlier, the two performance metrics used for evaluating the system are the throughput and the mean latency. The performance evaluation has been carried out using two data sets of different sizes (see here and here).
The throughput and mean latency of query 1 for the small data set are 96,004 events/second and 6.11 ms respectively. For the large data set the throughput and mean latency of the query 1 are 71,127 events/sec and 13 ms.
The throughput and mean latency of query 2 for the small data set are 215,642 events/second and 0.38 ms respectively. For the large data set the throughput and mean latency of the query 2 are 327,549 events/sec and 0.73 ms.
A detailed description of the queries and specific optimization techniques that we have used in our queries can be found in a paper titled Continuous Analytics on Graph Data Streams using WSO2 Complex Event Processor, which will be presented shortly in DEBS 2016: the 10th ACM International Conference Event-Based Systems, June 2016.