[Article] Integrating Apache UIMA with WSO2 Complex Event Processor
By WSO2 Team
- 8 Mar, 2015
In this article we will focus on how to integrate the Apache UIMA framework with the WSO2 Complex Event Processor (CEP) in order to create a powerful system that extracts meaningful structured data from massive unstructured, dynamic data sources and processes them to provide meaningful results in real time.
Before going into the integration details, it is useful to have a basic understanding on how Apache UIMA framework works in order to identify the integration points with the WSO2 CEP.
|WSO2 CEP||4.0.0 and above (this version will be released soon)|
|Apache UIMA||2.5.0 and above|
|Apache OpenNLP||1.5.3 and above|
|Apache ActiveMQ||5.10 and above|
- Apache UIMA
- Apache ActiveMQ
- WSO2 Complex Event Processor (WSO2 CEP)
- Overview of System Components
- Descriptive View of Apache UIMA Pipeline
- Apache UIMA - WSO2 CEP Main Integration Points
- Configuring the Road.lk Traffic Analysis Sample
Apache UIMA is an Apache-licensed open source implementation of the UIMA specification; get the PDF or download the doc. Apache UIMA serves as a base for unstructured information management applications that analyze large volumes of unstructured information (text, images, videos etc.) in order to discover knowledge that is relevant to an end user. An example UIM application might extract peoples names, emails, addresses and other contact info from a large set of documents.
Apache UIMA provides a framework, components and infrastructure to develop UIM applications and provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
The framework provides an easy-to-use pipeline in order to analyse input (unstructured) data known as the CPE (Collection Processing Engine). Key components of this pipeline can be identified as follows.
- Collection reader
- Analysis engine
- CAS Consumer
Within the CPE, the analysis engine together with the CAS Consumer creates a CAS processor.
CAS (Common Analysis Structure) is the data structure used by UIMA to store both the unstructured data and the extracted (structured) data. CAS stores the extracted data as types which are similar to objects. Types contain features that are similar to object’s attributes.
CASes are created in the collection reader component and fed through the pipeline. Each CAS processor is able to access the unstructured text within a CAS and process or consume extracted data within the CAS.
Interfaces to a collection of data items (e.g., documents) to be analyzed. Collection readers return CASes that contain the documents to be analyzed, possibly along with additional metadata.
It takes in a CAS, analyzes its contents, and produces an enriched CAS with extracted information (known as annotators). Analysis engines are composed of building blocks called annotators which contain the logic to extract the information. Analysis engines can be chained together to form aggregate analysis engines.
It consumes the enriched CAS that was produced by the sequence of analysis engine and produces an application-specific data structure such as a search engine index or simply writes to a database or a file.
For more information refer Collection Processing Engine Developer’s Guide
Apache ActiveMQ is an open source messaging and integration pattern server which represents message oriented middleware. This implements JMS, which is widely used in modern day applications in order to share messages among several application while maintaining a common format and semantic.
ActiveMQ is used as a message store for the collection reader in this instance. The collection reader dequeues and retrieves the messages periodically.
WSO2 Complex Event Processor (WSO2 CEP)
WSO2 CEP is a lightweight, easy-to-use, open source complex event processing server available under Apache Software License v2.0. WSO2 CEP identifies the most meaningful events within the event cloud, analyzes their impact and acts on them in real-time.
The event flow inside the CEP can be graphically illustrated as follows.
The input event adaptor can be identified as the initial point of contact for the outside UIMA CAS Consumer. Input event adaptors can be implemented using different event adaptor typesand the respective CAS Consumer should be used to publish events to the input event adaptor. Events then go through an assigned event builder, defined event streams, an execution plan, an event formatter and finally the output event adaptor.
An important part in the above shown flow takes place within the execution plan where Siddhi queries are executed. This governs the output of the events that pass through the execution plan. Siddhi Query Language (SiddhiQL) is designed to process streams and identify complex event occurrences. Siddhi queries describe how to combine existing event streams to create new event streams.
Overview of system components
In the implemented system a Twitter client streams/extracts tweets from public tweet timelines that share traffic information and publishes them to an ActiveMQ topic.
Collection reader which acts as a subscriber to the topic will consume the tweets published into ActiveMQ. Once a tweet is received the collection reader creates a CAS, sets the document test and passes it to the analysis engine.
This project uses an aggregate analysis engine which is composed of two annotators (primary analysis engines) to extract the location and traffic level respectively. Apache OpenNLP models were used in-order to extract traffic locations and respective traffic levels.
In the scenario depicted above, CAS consumers perform the operation of publishing events to the WSO2 CEP using different event adaptors available in the CEP (WSO2Event, SOAP, HTTP, MQTT and JMS are a few event adaptor types that are available in WSO2 CEP).
Descriptive view of Apache UIMA pipeline
Main points in Apache UIMA - WSO2 CEP integration
The integration of Apache UIMA framework with the WSO2 CEP has two key connection points;
- CAS Consumer publishing to WSO2 CEP input event adaptor
- Using WSO2 CEP to publish messages to ActiveMQ, to be consumed by the collection reader of the UIMA CPE.
Using the above two connection points it is possible to feed events into CEP from UIMA and vice versa. These two connection points, and components which communicate during event transfer can be identified as follows;
CAS Consumer publishing to WSO2 CEP input event adaptors
As shown in the above diagram, different CAS Consumers can be developed to publish events to WSO2 CEP.
CAS Consumer mainly consists of an ‘initialize’ method and a ‘process’ method. When the CPE is initialized, all the components in it, including the collection reader, analysis engine and CAS Consumer, gets initialized (one-time operation) and for each iteration of the analysis process, the process() method gets executed. Hence, for different event adaptor types, the initialize() method can be used to configure the connection and the process() method can be used to publish events accordingly.
- For a CEP WSO2Event Publisher:
initialize() method would include DataPublisher initialization and the data stream definition
process() method would be used to send events
Using WSO2 CEP to publish messages to ActiveMQ
WSO2 CEP can be used to publish messages (via a topic) to ActiveMQ, which will store published messages in a queue and later dequeue it (or for topics, send it to subscribers) using a client/subscriber.
In this project, this connection point is simulated using a simple ActiveMQ publisher, which is used in place of WSO2 CEP.
Configuring the Road.lk traffic analysis sample
The Apache UIMA - WSO2 CEP event transfer process can be tested using the Road.lk traffic analysis sample provided with this project.
This sample flow includes;
- Extracting the real-time/timeline Twitter feeds of road_lk from Twitter
- Publishing the Twitter feeds as messages to a topic in ActiveMQ
- Configuring the WSO2 CEP with relevant input/output event adaptors, stream definitions, event builders/formatters and execution plans
- Receiving messages from the subscribed topic in ActiveMQ
- Feeding the received messages into the Apache UIMA pipeline
- Publishing annotated CASes to WSO2 CEP
- Observing the web UI for traffic updates
- Install Apache ActiveMQ (Follow this guide)
- Setup Apache Ant version 1.7.0 or later
- Download CEP 4.0.0 pre release from here
- Download the complete sample pack with necessary libraries here
Executing the Sample
Extract the downloaded sample pack and copy the “3001” folder to <CEP_HOME>/samples/artifacts directory
Copy both twitter-UIMA and twitterClient-ActiveMQ folders found inside the “producers” folder of the extracted sample pack to <CEP_HOME>/samples/producers
Copy the above mentioned libraries to <CEP_HOME>/samples/libs directory. You can find them inside the “libs” folder of the extracted sample pack.
(Alternatively you could find all necessary libs inside the github repo provided for the sample)
Start the ActiveMQ server. Use the command given below to do this;
Deploy the GeoCode Siddhi Extension in WSO2 CEP. This can be done as follows;
Copy geocoder-java-0.16.jar and geocode-1.0.0.jar files found inside the sample pack to <CEP_HOME>/repository/components/lib folder
Add ‘org.wso2.siddhi.extension.geocode.GeocodeTransformer’ entry to the <CEP_HOME>/repository/conf/siddhi/siddhi.extension file
WSO2 CEP should be started in the required configurations, by running the provided package as a CEP sample.
Open a terminal from <CEP_HOME>/bin and type;
On Linux: ./wso2cep-samples.sh -sn 3001
On Windows: wso2cep-samples.bat -sn 3001
The execution plan for the above configuration should look like this once configured;
This configuration would include several input/output adaptors in order to demonstrate different input event types as well as different event presentation methodologies (eg: database representation, eeb UI, simple output streams etc.). All these adaptors can be tested using different consumers, available in UIMA-CEP-Sample. Follow the Readme file within the Consumers directory to enable different types of CAS Consumers.
Start TwitterCPE by typing ant within the TwitterCPE directory (if the ActiveMQ topic name is changed, make relevant changes)
Create a Twitter app, which will be used to extract tweets from the Twitter timeline. You can follow this link on how to set up a Twitter app.
You will need the consumer key, consumer secret, access token and access token secret credentials of the app in order to configure the twitterConfig.xml for the Twitter ActiveMQ client to extract/stream the tweets and publish them to the configured topic.
You have to choose between the ActiveMQ streaming publisher or ActiveMQ search publisher from twitterClient-ActiveMQ. You have to configure the Twitter app related settings by changing the Twitter app consumer key, consumer secret, access token, access token secret (use the credentials of the app created in Step 3) and the follower name (for the search) or Twitter user IDs (to filter the streaming) in the twitterConfig.xml file.
To run the selected publisher, type the following inside TwitterClient-ActiveMQ directory;
- To run the client in the streaming API mode (tweets will be received in real-time)
- ant -DjmsUrl=tcp://localhost:61616 -DtopicName=Feed -Dstreaming=true
- ant -Dstreaming=true (will take above url as the default url and above topic name as the default topic name)
- To run the client in the extractor mode (past tweets will be extracted from the timeline)
- ant -DjmsUrl=tcp://localhost:61616 -DtopicName=Feed
- jmsUrl defines the ActiveMQ connection URL (if the argument is not set, default is set to tcp://localhost:61616)
- topicName defines the ActiveMQ queue name to which the messages are enqueued. (if the argument is not set, default is set to Feed)
- streaming toggles between the two types of tweet publishers (if the argument is not set, default is set to false)
By default the topic name should be "Feed". If you define another name, then the relevant change should be done in the descriptor file in TwitterCPE/descriptors/readers/TwitterActiveMQReader.xml
Extracted traffic information will be displayed on the web UI.
You can access the UI by typing in localhost:9763/TrafficAnalyzer in your browser
Note: If you choose to run the client in Streaming mode, please note that tweets will be received in real-time, and you will only see an output if a certain tweet is received at the time you run the client (from the filtered Twitter IDs in the twitterConfig.xml file).
Therefore you have the option of waiting for a tweet to be posted or adding your twitterID to the twitterConfig.xml file and posting some tweets yourself. In order to find your twitterID from your username you can use this link or any other online website.
If you would like to know more details about our project like how we developed the UIMA Collection Processing Engine to extract the traffic data, how we connected the UIMA pipeline and WSO2 CEP and the corresponding source codes please refer to the blog links we have listed under the additional resources section below.
Apache UIMA Framework provides a convenient path to develop analysis engines which can be used to extract useful information as shown in the WSO2 CEP sample. The pluggable nature of UIMA allows users to plug in different components such as OpenNLP; which makes analysis more meaningful as well as more convenient. All the extracted information can be further processed using the WSO2 CEP to make them more presentable with a low latency output.
- Creating a Twitter client and using it with Twitter4j
- Introduction to Apache UIMA
- Hands on development - UIMA Pipeline
- Using OpenNLP with UIMA
- Using Apache OpenNLP Document Categorizer
- Writing CAS Consumers to send events to WSO2 CEP
Contributors to this article
- Farasath Ahamed, Intern, WSO2 Inc.
- Achintha Reemal, Intern, WSO2 Inc.
- WSO2 Team