Library

[Article] Natural Language Processing with WSO2 CEP

  • By Malithi Edirisinghe
  • 11 Feb, 2015

Applies to

WSO2 CEP 4.0.0 (this version will be released soon. WSO2 CEP 3.1.0 is available here)

Table of contents

Introduction

Natural Language Processing (NLP) relates to systems’ ability to process and understand natural language input. With NLP, we can write computer programs to extract information and relationships, and derive meaning from natural languages, such as English. WSO2 CEP enables users to do such NLP operations that analyze and derive meaning from some English language input with a set of Siddhi extensions based on the Stanford Natural Language Processing Library. In this article, we will discuss the Stanford NLP library and the six Siddhi NLP query operations with examples.

Stanford Natural Language Processing Library

Stanford Natural Language Processing Library includes a set of statistical NLP toolkits for various computational linguistic problems. The full set of stanford library distributions can be found at [1]. Among these, the WSO2 CEP Siddhi extension set of NLP operations use the following toolkits from the Stanford NLP Library distribution.

Stanford CoreNLP

This distribution contains an integrated suite of natural language processing tools, such as tokenization, parts-of-speech tagging, named entity recognition, parsing, etc. Hence, we can identify base forms of words, their parts of speech, whether those words are names of companies, people, and markup the structure of sentences, etc.

This distribution helps to define a pipeline that enables the above tools with a simple configuration as shown below.

  
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER and parsing 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

As you can see here a property is defined as annotators, with a comma separated list as its value. Each property in this comma separated list maps with an annotator, which do the NLP processing. So here the property tokenize refers to the PTBTokenizerAnnotator, which tokenizes the input text; ssplit refers to the WordToSentenceAnnotator which aggregates a sequence of tokens into sentences. pos refers to the POSTaggerAnnotator, which label each token with a POS tag. lemma refers to the MorphaAnnotator, which will generate word lemmas for all tokens in the corpus. ner refers to the NERClassifierCombiner, which will recognize named entities and will label them, such as PERSON, LOCATION, ORGANIZATION, MONEY, etc. The parse refers to ParserAnnotator, which will analyze the dependency representation of each sentence. So passing this comma separated value will aggregate all these annotators into an annotator pipeline in the order defined.

Each NLP Siddhi query operation will instantiate a Stanford Pipeline that enables tokenization, sentence splitting, POS tagging, lemmanization, named entity recognizing and parsing, when each query is initialized. That pipeline is then used to process each raw language input received with events. For more information on the Stanford coreNLP library you can refer to [2].

Stanford TokensRegex

This framework enables defining patterns over text. Rather than working at character level as with standard regular expression packages, this framework can describe text as a sequence of tokens, such as words, punctuation marks, etc., which may have additional attributes, such as POS tags. Therefore, it supports writing patterns over those tokens. The syntax used here is very similar to the syntax of java regular expressions; the main difference is the syntax defined to match tokens.

In TokensRegex syntax, each token is represented by [], where specifies the token attributes. We can define several attributes as

 {;  ...}

Here, each consists of a .

Below you can see some basic examples of expressions that we can use with TokensRegex.

As shown above, we can write simple expressions as well as compound expressions using !, & and |. Further, we can combine tokens using similar syntax as Java regular expressions, such as quantifiers, grouping, capturing, etc.

The table below shows a summary of syntax that can be used.

So with these capabilities, we can write more advanced regular expressions with TokensRegex to extract patterns from NLP processed text.

Below is an example of such regular expression that we have used in our samples to extract some phrases from a Twitter feed.

([ner:/PERSON|ORGANIZATION|LOCATION/]+) (?:[]* [lemma:donate]) ([ner:MONEY]+)

This expression checks for one or more tokens tagged with the named entity relationship as PERSON, ORGANIZATION or LOCATION followed by zero or more tokens, which is also followed by a token with lemmatization for donate. Then there should be one or more tokens with the named entity tag as MONEY. You can see that there are three groups here. But the middle group is marked as a non-capturing group. Therefore, with this expression, we can extract any phrase matching the full expression and also we can get the phrases matching the two groups within that matched phrase.

E.g. Tweet: Bill Gates donates $31 million to fight Ebola http://t.co/Lw8iJUKlmw

Match for the full regex: Bill Gates donates $31 million

Match for first group: Bill Gates

Match for second group: $31 million

NLP Siddhi extension at CEP exposes this capability via findTokensRegexPattern function, thus a user can use TokensRegex pattern language to extract phrases matching a given regular expression. We will discuss about this query function later. For more information on TokensRegex you can refer to [3].

Stanford Semgrex

This framework enables defining patterns over Stanford NLP Semantic Graph structure. This framework can be used to identify graph nodes, their attributes and grammatical relations identified by Stanford dependencies utility via regular expressions. The parsed semantic graph is a directed graph with nodes and edges, where nodes represent the words and each edge represents the grammatical relationship between two nodes.
For example, let’s consider the sentence below.

My brother and I went to the market.

The Stanford parser will parse this sentence and will build the following semantic graph.

Each grammatical relation is held above a governor and a dependent. For an example if we consider prep_to relationship in above graph, we can see that this edge is directed from node went to node market. Thus, went becomes the governor whereas market becomes the dependent on prep_to. There are about 50 grammatical relationships defined in Stanford dependencies. For more information on Stanford dependencies you can refer their mannual at [4].

With Semgrex utility we can identify patterns in the Semantic graph structure described above. Semgrex patterns are composed of nodes and relations between them. Here a node is represented as {<expression> }, where <expression> specifies the attributes. We can define several attributes as {<attr1>; <attr2> ... }

Syntax for defining attributes is same as TokensRegex described before.
Similarly as in TokensRegex, regular expressions are marked within /<regex>/
Below you can see some examples on defining nodes in Semgrex.

Expression Description
{} any node in the graph
{$} any root in the graph
{#} empty word
!{lemma:red} any word that isn’t “red”
({lemma:locate} | {ner:LOCATION}) a node that is either a word with lemma locate or with named entity tag LOCATION. Here ( ) groups the expression.

In Semgrex, we can specify the relationships between nodes as below.


Syntax Description
A <reln B A is dependent of B on relation reln
A >reln B A is governor of B on relation reln
A <<reln B There is some node between the dependent, governor chain from A to B that is a dependent of B on relation reln
A >>reln B There is some node between the governor dependent chain from A to B that is a governor of B on relation reln

So with nodes and relations that described above we can build complex expressions.

ex 1:
{} >nsubj {} >dobj {}

Here each relation is matched relative to the first node specified in the expression. Therefore this expression means a node that is the governor of both nsubj relation and a dobj relation. It is similar to
{} >nsubj {} & >dobj {}

ex 2:
{} <agent {} | <nsubj {}

This means a node that is either a dependent on agent relation or nsubj relation in the graph.

Further, we can group relations with ( ) just like grouping nodes.
ex 3:
{tag:nn} (<prep_in {} | <prep_on {})

This means a node that is a noun which is a dependent on prep_in relation or prep_on relation.

Semgrex also provides the ability to name relations and nodes and to extract them from the names, or refer them somewhere else in the same expression with the defined name. Both relations and nodes are named with =<tname> after defining the relation or node.
ex 1:
{} <nsubj {}=verb

Here the pattern means a node that is a dependent on relation nsubj, i.e a subject of some node. Since we have named the governor node as “verb”, this pattern stores the governor node as well and we can get it via the defined name “verb”. Similarly we can name relations too.

ex 2:
{} >/.*subj|agent/=reln {}

This pattern returns a node that is a governor on either a subject or agent relation. But since we have named the relation as “reln” we can get the relation with that defined name.

Below you can see a sample Semgrex pattern that we have used in our samples.
{lemma:die} >/.*subj|num.*/=reln {}=diedsubject

This expression looks for nodes with lemmatization for die such as dies, died etc, which governs some other node named as diedsubject on relations like nsubj, csubj, xsubj, num or number. So this diedsubject simply represents the subject that has died it may be a number or some noun acting as the subject in the sentence.

ex:
Tweet : Over 150 nurses and healthcare workers have died doing their job #ebola @nswnma @GlobalNursesU

Here we get two matches for the diedsubject so when you find matches for the above expression you will get below.

Match for the full expression: died diedsubject: nurses
Match for the full expression: died diedsubject: workers

NLP Siddhi extension at CEP exposes Semgrex Stanford utility via three functions; findRelationshipByRegex, findRelationshipByVerb and findSemgrexPattern. findSemgrexPattern function directly exposes the Semgrex utility so that we can define any Semgrex pattern and use it to find matching events whereas findRelationshipByRegex and findRelationshipByVerb wraps that capability making things simpler for the user. We will discuss about these extensions in detail later.

NLP Functions

So now we are going to discuss about the NLP query functions supported in Siddhi and we will look at some samples on how to use them. There are six NLP operations supported.

  • findNameEntityType(entityType:string, groupSuccessiveEntities:boolean, text:string) Extract nouns in the text, which match any predefined entity type such as PERSON, LOCATION, DATE...etc.
  • findNameEntityTypeViaDictionary(entityType:string, dictionaryFilePath:string, text:string) Extract all matches in the text, for entries defined in the dictionary xml file under the given entity type
  • findRelationshipByRegex(regex:string, text:string) Extract (subject, object, verb) relationship from the text, that match the given regular expression.
  • findRelationshipByVerb(verb:string, text:string) Extract (subject, object, verb) relationship from the text that match any form of the verb.
  • findTokensRegexPattern(regex, text) Extract phrases that match the given NLP regular expression pattern
  • findSemgrexPattern(regex, text) Extract words that match the given grammatical relationship regular expression pattern

Each query function above wraps the core Stanford library capabilities that we discussed above. So lets look at each function in detail.

Setting Up

First check whether you have installed the WSO2 Carbon GPL - Siddhi NLP Extension Feature feature . You can check that by login to the management console and checking Installed features at Home > Configure > Features. If its not available you can build the extension from the source. For that, first build the orbit at [5] with maven. Copy the artifact stanford-nlp-3.4.0-wso2v1.jar to /repository/components/dropins. Then build [6] with maven. Copy the artifact nlp-2.2.0-SNAPSHOT.jar at /siddhi-extensions/nlp/ to /repository/components/dropins.

In case you faced any problem building the targets, use the attached CEP server, stanford-nlp-3.4.0-wso2v1.jar and nlp-2.2.0-SNAPSHOT.jar.

Download the attached zip that bundles the six samples. Extract the zip and copy sample folders ‘1101’ to ‘1106’, to /samples/artifacts. Then copy the dictionary.xml in resource folder to /samples/resource/dictionary.xml. Finally update, /repository/conf/siddhi/siddhi.extension file by adding following.

org.wso2.siddhi.gpl.extension.nlp.NameEntityTypeTransformProcessor

org.wso2.siddhi.gpl.extension.nlp.NameEntityTypeViaDictionaryTransformProcessor

org.wso2.siddhi.gpl.extension.nlp.RelationshipByRegexTransformProcessor

org.wso2.siddhi.gpl.extension.nlp.RelationshipByVerbTransformProcessor

org.wso2.siddhi.gpl.extension.nlp.SemgrexPatternTransformProcessor

org.wso2.siddhi.gpl.extension.nlp.TokensRegexPatternTransformProcessor

In order to run each sample copied, go to /bin directory, and enter

sh wso2cep-samples.sh -sn 110X to start the CEP server.

ex:

sh wso2cep-samples.sh -sn 1101

Each sample artifact will load the corresponding execution plan and a sample csv file that contains a set of tweets obtained from a twitter stream on Ebola. Each entry in the sample csv file is in below format, where the separator “~” is used to separate the event attributes.

~

ex:

encomium~Patrick sSawyer’s chain of Ebola victims

So playing this file will generate an event stream for each entry in the file. Each generated event will have two attributes; Twitter username and the tweet of that user.

You can simply play this file using the Event Simulator at Tools > Event Simulator and see the tweet and the result that gets generated.

findNameEntityType

The findNameEntityType function takes in a user given string constant as entity type, a user given boolean constant in order to group successive matches of the given entity type and a text stream. It returns noun(s) in the text stream that are of the defined entity type. Here the entity type is a string constant and the value should be either PERSON, LOCATION, ORGANIZATION, MONEY, PERCENT, DATE or TIME. If we give group successive matches as true the result will aggregate successive words of the same entity type.

E.g.

Bill Gates donates £31million to fight Ebola

If we pass true for group successive matches parameter with the entity type as PERSON we will get only one event and the output will contain the word “Bill Gates” as the match. If we pass false we will get two events one for the word “Bill” and other for the word “Gates”.

You can try this operation with the sample 1101. Start the server with the sample artifact as described above. In Main > Manage > Event Processor > Execution Plans you will see an execution plan as NlpFindNameEntityTypeExecutionPlan. This execution plan will contain the below query.

  
from InEventStream #transform.nlp:findNameEntityType("PERSON",true,text) 
select *
insert into OutEventStream ;

Here the input event stream named as InEventStream will stream the events that you generate by playing the sample csv file uploaded, via the Event Simulator. The output event stream named as OutEventStream will grab the output events generated from the function. These events are then logged in the console since the output event stream is associated with a LoggerOutputAdaptor.

Input from File:

  
encomium~Patrick sSawyer’s chain of Ebola victims
Atlanta News~Obama to visit Atlanta Tuesday for Ebola update: President Barack Obama is scheduled Tuesday to...  #Atlanta #GA
Professeur Jamelski~RT @DailyMirror: Bill Gates donates £31million to fight Ebola
#BRINGBACKOURGIRLS#~RT @metronaija: #Nigeria #News Ebola Survivor Dr. Kent Brantly Donates Blood to Treat Another Infected Doctor
going info~Trail Blazers owner Paul Allen donates $9 million to fight Ebola


Output:

So here you can clearly see the events generated. The attribute match gives the NLP operation result along with the input stream attributes username and text. The function checks for words of entity type “PERSON” and for each found word generates an output event. Since we have passed true for the parameter group successive matches, the function has combined successive words for the given entity type. So it has generated only a single event for names like “Bill Gates”, “Kent Brantly” and “Paul Allen”.

findNameEntityTypeViaDictionary

The findNameEntityTypeViaDictionary function takes in a user given string constant as entity type, a user given dictionary file path and a text stream. It returns word(s) or word phrase(s) in the text stream that match with the dictionary entries for the given entity type that I have described in the function above.

There is a defined format for the dictionary file as below, and you can specify your own as required.

Dictionary Definition:

  

    
        Bill
        Addison
    
    
        Mississippi
        Independence Square
    
    
        WSO2
    


Here the element reflects the entity type. The value for the attribute ‘id’ of this element can only be one of following and please note that this value is case sensitive.

PERSON, LOCATION, ORGANIZATION, MONEY, PERCENT, DATE, TIME

Element <entry> can contain any text that you want to match or identify as a particular entity.

You can try this operation with the sample 1102. First make sure that you have copied the dictionary.xml in resource folder of the attached zip to /samples/resource/ folder. This file will define some entity types.

  

    
        Obama
        Kent Brantly
	Bill Gates
	Paul Allen
    
    
        Africa
        Atlanta
        Mississippi
        Morocco
        Zimbabwean
    
    
        yesterday
        Tuesday
        Today
        September
        Friday
    
    
        million
        LKR
        USD
    


Start the server with the sample artifact.

In Main > Manage > Event Processor > Execution Plans you will see an execution plan as

NlpFindNameEntityTypeViaDictionaryExecutionPlan.

This execution plan will contain the below query.

  
from InEventStream #transform.nlp:findNameEntityTypeViaDictionary("PERSON","samples/resource/dictionary.xml",text) 
select *
insert into OutEventStream;

Just play the sample file uploaded via EventSimulator to generate events. You would see the output in the console.

Input from File:

  
encomium~Patrick sSawyer’s chain of Ebola victims
Atlanta News~Obama to visit Atlanta Tuesday for Ebola update: President Barack Obama is scheduled Tuesday to...  #Atlanta #GA
Professeur Jamelski~RT @DailyMirror: Bill Gates donates £31million to fight Ebola
#BRINGBACKOURGIRLS#~RT @metronaija: #Nigeria #News Ebola Survivor Dr. Kent Brantly Donates Blood to Treat Another Infected Doctor
going info~Trail Blazers owner Paul Allen donates $9 million to fight Ebola

Output:

In the query we have defined the entity type as ‘PERSON’. So this will match the entries that we have defined in the dictionary under ‘PERSON’ with the input event texts. So in this case it should match Obama, Kent Brantly, Bill Gates and Paul Allen. Check the output. You can see that events are generated for above matches.

findRelationshipByVerb

The findRelationshipByVerb function takes in a user given string constant as a verb, and a text stream. Then it returns subject, object, verb relationship from the text stream that can be extracted for any form of that verb. This query internally use two Semgrex patterns to find relationships considering both active and passive voice. One will find subject, verb, object relationship where subject could be optional and the other will find subject, verb, object relationship where the object could be optional.

You can try this with sample 1103. The execution plan will contain the below query.

  
from InEventStream #transform.nlp:findRelationshipByVerb('say', text) 
select * 
insert into OutEventStream;

This query will looks for subject, verb, object relationships for any form of the word ‘say’ acting as a verb. Play the sample file uploaded via EventSimulator to generate events. You would see the output in the console.

Input from File:

  
Democracy Now!~@Laurie_Garrett says the world response to Ebola outbreak is extremely slow & lacking.
Zul~No Ebola cases in the country, says Ministry of Health Malaysia
Mainstreamedia~Precaution taken though patient does not have all Ebola symptoms, says minister
Charlie Lima Whiskey~Not Ebola, ministry says of suspected case in Sarawak  via @sharethis
Bob Ottenhoff~Scientists say Ebola outbreak in West Africa likely to last 12 to 18 months more & could infect hundreds of thousands
TurnUp Africa~Information just reaching us says another Liberian With Ebola Arrested At Lagos Airport
_newsafrica~Sierra Leone Says Ebola Saps Revenue, Hampers Growth
susan schulman~An aid worker says #Ebola outbreak in Liberia demands global help
Naoko Aoki~Story says virologist was asked to return to Ebola area w/o pay. Hope I'm missing something  via @washingtonpost
Marc Antoine~U.S. scientists say Ebola epidemic will rage for another 12-18 months
UMI Wast~Massive global response needed to prevent Ebola infection, say experts

Output:

findRelationshipByRegex

The findRelationshipByRegex function takes in a user given regular expression that match the Semgrex pattern syntax, and a text stream. Then it returns subject, object and verb from the text stream that match with the named nodes of the Semgrex pattern. We discussed about how to name nodes when writing Semgrex patterns before. So here we have to make sure that the regular expression we define contains three named nodes, named as the subject, verb and object.

ex:

 ‘{}=verb >/nsubj|agent/ {}=subject >/dobj/ {}=object’

You can try this operation with the sample 1104. The execution plan will contain the below query.

  
from InEventStream #transform.nlp:findRelationshipByRegex
('({} nsubj {}=subject)) >/nsubj|num/ {}=object', text) 
select * 
insert into OutEventStream;

Look at the regular expression passed to the query. There are three named nodes as verb, subject and object. This expression will mark the words with lemmatization donate as the verb, and the nouns that depend on this word with nsubj relation as the subject. Then mark nouns or numbers that depends on this group as the object.

Play the uploaded csv file from the EventSimulator to generate events and check the output.

Input from File:

  
Professeur Jamelski~Bill Gates donates $31million to fight Ebola
going info~Trail Blazers owner Paul Allen donates $9million to fight Ebola
WinBuzzer~Microsoft co-founder Paul Allen donates $9million to help fight Ebola in Africa
theSun~Malaysia to donate 20.9m medical gloves to five Ebola-hit countries
CIDI~@gatesfoundation donates $50M in support of #Ebola relief. Keep up the #SmartCompassion, @BillGates & @MelindaGates!
African Farm Network~Ellen Donates 100 Bags of Rice, U.S.$5,000 to JFK[Inquirer]In the wake of the deadly Ebola virus

Output:

findSemgrexPattern

The findSemgrexPattern function simply exposes the Stanfors Semgrex utility. It takes in a user given regular expression that match the Semgrex pattern syntax, and a text stream. Then it returns word(s)/phrase(s) from the text stream that match with the Semgrex pattern and word(s)/relation(s) that match with each named node and each named relation defined in the regular expression.

Try this operation with sample 1105. The execution plan will contain the below query.

  
from InEventStream #transform.nlp:findSemgrexPattern 
('{lemma:die} >/.*subj|num.*/=reln {}=diedsubject', text) 
select * 
insert into OutEventStream;

Look at the Semgrex pattern given. This will look for words with lemmatization die which are governors on any subject or numeric relation. The dependent is marked as the diedsubject and the relationship is marked as reln. Thus, the query will return an output stream that will out the full match of this expression, i.e the governing word with lemmatization for die. In addition it will out the named node diedsubject and the named relation reln for each match it find.

Play the sample csv from Event Simulator and check the output in console.

Input from File:

  
Leighton Early~4th Doctor Dies of Ebola in Sierra Leone
☺Brenda Muller~If the Ebola Virus Goes Airborne, 1.2 million Will Die Expert Predicts -  via
Berkley Bear~Sierra Leone doctor dies of Ebola after failed evacuation
Gillian Taylor~BritishRedCross: #Ebola: "The deputy matron has worked 93 days straight while 23 colleagues have died"
Takashi Katagiri~These scientists made huge discoveries about Ebola--but 5 died before the paper was published.
*Jessica ♔~Over 150 nurses and healthcare workers have died doing their job #ebola @nswnma @GlobalNursesU

Output:

findTokensRegexPattern

The findTokensRegexPattern exposes Stanford TokensRegex utility. This function takes in a user given regular expression that match the Tokens regex pattern syntax, and a text stream. Then it returns word(s)/phrase(s) from the text stream that match with the Tokens regex pattern and word(s)/relation(s) that match with each group defined in the regular expression.

Try this with sample 1106. The execution plan will contain the below query.

  
from InEventStream #transform.nlp:findTokensRegexPattern ('([ner:/PERSON|ORGANIZATION|LOCATION/]+) (?:[]* [lemma:donate]) ([ner:MONEY]+)', text) 
select * 
insert into OutEventStream;

Look at the TokensRegex patterns passed to the query. It defines three groups and the middle group is defined as a non capturing group. The first group looks for words that are entities of either PERSON, ORGANIZATON or LOCATION with one or more successive words matching same. Second group represents any number of words followed by a word with lemmatization for donate such as donates, donated, donating etc. Third looks for one or more successive entities of type MONEY.

Play the sample csv file from the Event Simulator and check the output in console.

Input from File:

  
Professeur Jamelski~Bill Gates donates $31million to fight Ebola http://t.co/Lw8iJUKlmw http://t.co/wWVGNAvlkC
going info~Trail Blazers owner Paul Allen donates $9million to fight Ebola
WinBuzzer~Microsoft co-founder Paul Allen donates $9million to help fight Ebola in Africa
Lillie Lynch~Canada to donate $2.5M for protective equipment http://t.co/uvRcHSYY0e
Sark Crushing On Me~Bill Gates donate $50 million to fight ebola in west africa http://t.co/9nd3viiZbe

Output:

References

  1. http://nlp.stanford.edu/software/index.shtml
  2. http://nlp.stanford.edu/software/corenlp.shtml
  3. http://nlp.stanford.edu/software/tokensregex.shtml
  4. http://nlp.stanford.edu/software/stanford-dependencies.shtml
  5. https://github.com/wso2-gpl/orbit
  6. https://github.com/wso2-gpl/siddhi

Note:

  1. Download the sample setup zip here.
  2. Download CEP server here.
  3. Download stanford-nlp-3.4.0-wso2v1.jar here.
  4. Download nlp-2.2.0-SNAPSHOT.jar here.

About Author

  • Malithi Edirisinghe