March 04, 2019

Conceptualizing the Knowledge Graph Construction Pipeline

The advent of the internet has granted access to a large number of content creators to generate information. Owing to this, there is a massive amount of data that is now present on the web. In order to provide useful insights, we need an efficient way to represent all this data. One such efficient knowledge representation method is via knowledge graphs. In brief, a knowledge graph is a large network of interconnected data. Knowledge graphs are constructed from knowledge bases. Knowledge bases gather their information from free text on web pages, databases, and audio and video content. The basic pipeline of a knowledge graph’s construction process is shown in Figure 1.

Figure 1. The knowledge graph construction pipeline

Now, let’s go through the processes that take place within this pipeline in detail.

During the first phase of the pipeline, we identify facts from free text. Initially, we scour the internet to filter useful information by identifying the entities and the relationships that the entities are involved in from free text. This identification process takes place using natural language processing techniques, such as named entity resolution, lemmatization, and stemming. Hence, the data extracted from free text in the first step may resemble the form of the following statement.

“The Louvre is located in Paris”

Proceeding to the second phase of the pipeline, the statements are generalized in the form of triples within knowledge bases; these triples will be categorized under different ontologies using an ontology extraction process that can harness the capabilities of natural language processing techniques as well. A triple is composed of a subject, the predicate, and its object. The subject and object are entities that are involved in a relationship defined by the predicate. Hence, for the previous statement identified from free text, we break this down in the following form of a triple for the knowledge base.

Subject : Louvre

Predicate : is located

Object : Paris

So within a knowledge base, we will have the above relationship in the form of islocated(Louvre, Paris). This is a single triple within a knowledge base. In practice, knowledge bases include millions of such triples, which we also term as ‘facts’. These facts are grouped under ontologies in knowledge bases. An ontology is an identifying category for a particular domain of facts. Hence, an ontology explains what sort of entities exist within that category. For example, if the ontology is ‘airport’, then, some of the entities that fall under this category may include ‘addison airport’, ‘charles de gaulle airport’, ‘mandelieu airport’, and so on.

Knowledge bases can either be domain-specific or generic. Medical knowledge bases and academic research paper knowledge bases are some domain-specific knowledge bases. However, generic knowledge bases do not constrain their knowledge to a particular domain. They have a broader coverage of general worldly facts and multiple domains.

Before we move forth to the final phase of the pipeline, which is the knowledge graph, refer to the table below for some characteristics of various knowledge bases as comprehended from their original papers. The table lists knowledge bases that have been of prime importance over the past decades.

Table 1. Knowledge bases and their characteristics

Feature DBpedia Cyc Freebase Wikidata
Data extraction Extracts data from unstructured information on the semantic web, specifically, Wikipedia infoboxes. Extracts common sense rather than directly searchable facts on the internet. Modeled with the knowledge of its community members and Wikipedia. Powers Wikipedia, Wikisource, Wikivoyage, and related Wikimedia content.
Fact storage format Wikipedia articles are converted into structured content in the form of RDF triples. Stored as triples with 3 major components: predicate, instance and collection. Stored as RDF triples. Stored in an RDF format with XML properties.
Content language Mainly in English. However, since it derives its data from Wikipedia, it holds links to local versions in varying languages. All in English and is built by experts. Mainly in English, but derives information from versions in varying languages. Since Wikimedia is attributed in several languages, Wikidata is also not constrained to only English.
Knowledge base update A static knowledge base that is updated once a year since its creation in 2007. These facts are static and are not continuously or periodically updated. Underwent a continuous update until it was deprecated in 2015. Continuously updated as Wikimedia content are created on a daily basis.
Querying for information Queried using SPARQL. Proprietarily owned by Cycorp. So, only a portion of the KB has been openly released for use as Opencyc. Queried using MQL (Metaweb Query Language). Before being deprecated, the facts in Freebase were accessible through an open API, searchable through the interface shown in Figure 2 Queried using SPARQL or mainly through its API, Wikibase-API1.
Validity of extracted information Quality of the extracted DBpedia facts depend on the Wikipedia content quality and extraction methods applied The quality needs to be verified through the validation of actual queries where the extracted facts are applicable. Highly verifiable compared to the other KBs as expert knowledge is an input in addition to automatically extracted data. The verifiability of Wikidata depends on the Wikimedia content created by its contributors.

1Wikibase API : https://en.wikipedia.org/w/api.php

Figure 2. The Freebase Topic Page, where a user adds a sibling property (Source: Bollacker et al. [3])

With regard to knowledge bases, let’s further explicate the NELL knowledge base, as we’ll be considering the way in which NELL handles its facts, as a sample for the knowledge graph construction phase of the pipeline that we’ll be discussing later.

NELL

Never-Ending Language Learner (NELL) was a project that was initiated at the Carnegie-Mellon University in 2010 [5]. It was modeled to gap the difference between a learning system and actual human learning. As such, it was based on the concept that continuous learning of facts shapes expertise. NELL has been continuously learning facts since 2010. This knowledge base primarily performs two tasks.

  1. Information extraction: Scouring the semantic web to discover new facts, accumulating those facts and extending its knowledge base continuously.
  2. Enhance the learning process: Based on its previous experience in extracting information, NELL tries to improve its learning ability by returning to the page from which it learned its facts the previous day, and searches for newer facts.

NELL’s facts are based on an ontological classification: the entity or the relation. Entity-based ontological classification consists of subdomains of instances that could occur in that domain, whereas relation-based ontological classification comprises sub-domains of facts based on the relationship that connects the entity instances. The facts in NELL are in the form of triples (subject-object-predicate). For example,

Sample fact : “The Statue of liberty is located in New York”

As a triple, the above fact can be represented as locatedIn (statueOfLiberty, newYork) where,

Subject: statueOfLiberty

Predicate: locatedIn

Object: newYork

NELL’s facts are extracted using text context patterns, orthographic classifiers, URL-specified ML patterns, learning embeddings, image classifiers and ontology extenders. Currently NELL is constrained as it cannot modify its defined process of learning. If the process of learning can be dynamically enhanced based on previous learning experiences, NELL can improve the quality of its facts and the performance of accruing its facts.

Now, let’s move onto the final phase of the pipeline to see how the triples in knowledge bases are converted into a knowledge graph.

Knowledge Graphs

A knowledge graph is a large network of interconnected entities. The connections are created based on the triples from knowledge bases. The main intent of the knowledge graph is to identify the missing links between entities. In order to explicate this further, let’s consider the following sample relationships that we have gathered from the knowledge base.

Friends (Anne, Jane)

Friends (Jane, Jim)

LivesIn (Anne, Paris)

LivesIn (Jim, Brazil)

LivesIn (Jane, Brazil)

BornIn (Anne, Paris)

BornIn (Jim, Paris)

If we try to build a basic knowledge graph based on only the above relationships, we will be able to visualize the following graph.

Figure 3. A knowledge graph constructed only using the observed facts

On the other hand, there are some unknown relationships that were not explicitly retrieved from the knowledge bases, such as,

  • Are Anne and Jim friends?
  • What is Jane’s birthplace?

This means that such relationships can be considered as missing links.

Figure 4. The missing links in the knowledge graph

These missing links are inferred using statistical relational learning (SRL) frameworks. These SRL frameworks compute the relational confidence of an inferred/predicted link. There are different ways in which previous works have attempted to discover new/missing information as well as compute the confidence in inferencing those information. These are discussed in brief in the following paragraphs.

In the first phase of the pipeline, where we extract facts from free text, we often end up with erroneous facts as well. In order to identify a stable knowledge graph from these facts, Cohen et al. proposed a methodology to jointly evaluate the extracted facts [6]. The issue with this method was its consideration of only a trivial set of possible errors that could occur in extracted facts.

As the second phase of the pipeline, we find triples from extracted facts and these triples will make up the knowledge base. Proceedingly, during the final phase, we need to discover new facts by inferring missing links from the knowledge base triples. For this purpose, following Cohen, Jiang et al. resorted to Markov Logic Networks to discover relationships between extracted facts [7]. They defined ontological constraints that are specified in the form of first order logic rules. These constraints would administer the possible relationships that can be inferred. However, with the Markov Logic Network, the logical relationships, that we term as ‘predicates’, could only take a boolean value for its variables. This posed a disadvantage in inferring a confidence for the facts.

This led to the definition of the Probabilistic Soft Logic (PSL), which uses the concepts of Jiang et. al’s Markov Logic Network, and defines a sophisticated statistical relational framework that jointly reasons over all the facts, to discover new/missing information based on the previous facts [8]. In addition to that, PSL probabilistically computes a confidence value, which is a soft truth value within the range of [0,1], inclusive, to indicate how far the PSL program believes that the fact is true, based on what’s been provided.

Once the new/missing information are discovered, and their confidences are calculated, we can build a knowledge graph with highly confident facts. This will provide us a graph where new information that cannot be explicitly driven, are available, in addition to the original facts that were extracted. And this is how we build a knowledge graph with the facts from knowledge bases and the newly discovered facts based on the available observations.

Finally, as we summarize these cascaded steps of the knowledge graph pipeline, on a higher level, the following are the processes that take place in building a knowledge graph [9].

Phase 1: Extracting facts from Free Text

  1. Data is extracted from free text, unstructured data sources and semi structured data sources.
  2. This raw data is processed in order to extract information. This involves the extraction of entities, relations, and attributes, which are the properties that further define entities and relations.
  3. If data is already structured, unlike in step 1, that data will directly proceed forth to be fused with information from third-party knowledge bases.
  4. Following this, various natural language processing techniques will be applied on top of the fused knowledge and the processed data. This includes the coreference resolution, named entity resolution, entity disambiguation, and so on.

  5. Phase 2: Formulating triples from extracted facts

  6. The above steps conclude the preprocessing of information for knowledge bases. Then, an ontology extraction process is carried out to categorize the extracted entities and the relations under their respective ontologies.
  7. Proceeding the ontology formalization, the facts will be refined and stored as triples in the knowledge base.

  8. Phase 3: Constructing the knowledge graph with new links and confidences

  9. In order to construct the knowledge graph from the knowledge base, statistical relational learning (SRL) will be applied on these triples.
  10. The SRL process computes a confidence for each fact as opposed to the entire domain in order to identify how far those facts would hold true.
  11. In constructing the knowledge graph, missing links will be identified using the confidence and the newly inferred relational links will be formed.

Since the confidences in the inference are incorporated in the knowledge graph, once the graph has been constructed, the decision on how far the facts will be considered to be true can be based on the confidences as well. As such, a sample knowledge graph of a movie actors’ domain, generated by Cayley [10], is shown below.

Figure 4. A sample knowledge graph of a movie actors domain

Subsequently, such knowledge graphs can be used in information retrieval systems, chatbots, web applications, knowledge management systems, etc., to efficiently provide responses to user queries.

Conclusion

Thus far, we’ve provided an abstract explanation of how the entire knowledge graph pipeline works. Using the techniques specified in these phases will guarantee the discovery of missing links. Nevertheless, an open-unknown that still floats around in the knowledge graph community is the identification of erroneous facts or triples according to human perspectives. Currently, we have methods that compute the confidence of existing and discovered relationships based on the domain and the set of facts. However, this does not provide a sure-footed way to say if the fact will be evaluated as a valid fact by an actual human evaluator. Hence, in our following post, we’ll look further into a detailed elucidation of how we infer missing links using a statistical relational frameworks such as the probabilistic soft logic and how a sufficient level of supervision can be correlated into the model to align the facts with the crowdsourced truths in the knowledge graph.

References

[1] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722-735). Springer, Berlin, Heidelberg.

[2] Lenat, D. B., & Guha, R. V. (1991). The evolution of CycL, the Cyc representation language. ACM SIGART Bulletin, 2(3), 84-87.

[3] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008, June). Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1247-1250). AcM.

[4] Vrandečić, D., & Krötzsch, M. (2014). Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10), 78-85.

[5] Betteridge, J., Carlson, A., Hong, S. A., Hruschka Jr, E. R., Law, E. L., Mitchell, T. M., & Wang, S. H. (2009). Toward Never Ending Language Learning. In AAAI spring symposium: Learning by reading and learning to read (pp. 1-2).

[6] Cohen, W. W., Kautz, H., & McAllester, D. (2000, August). Hardening soft information sources. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 255-259). ACM.

[7] Jiang, S., Lowd, D., & Dou, D. (2012, December). Learning to refine an automatically extracted knowledge base using markov logic. In Data Mining (ICDM), 2012 IEEE 12th International Conference on (pp. 912-917). IEEE.

[8] Brocheler, M., Mihalkova, L., & Getoor, L. (2012). Probabilistic similarity logic. arXiv preprint arXiv:1203.3469.

[9] 刘峤, 李杨, 段宏, 刘瑶, & 秦志光. (2016). 知识图谱构建技术综述. 计算机研究与发展, 53(3), 582-600.

[10] Open source graph database : https://cayley.io/

Image credits: https://www.iconfinder.com/