Manuela Nayantara Jeyaraj, Srinath Perera, Malith Jayasinghe, Nadheesh Jihan
International Conference on Computational Linguistics and Intelligent Text Processing (CiCLing’19)
This paper discusses a methodology that we proposed to efficiently filter out erroneous facts from knowledge bases in order to refine knowledge graphs. The project harnesses the capabilities of machine learning and probabilistic soft logic (PSL).
With the internet being bombarded with content on a daily basis, there is a dire need to dynamically discover useful information from the facts that are directly available. As stated in our paper,
“A knowledge extraction pipeline takes in data, converts it to a knowledge base, and finally provides the outcome of knowledge extraction as a knowledge graph”
We have already looked at the knowledge extraction pipeline in my previous blog post titled Conceptualizing the Knowledge Graph Construction Pipeline. So, you may have encountered all the buzz about knowledge graphs and the need and efforts to refine them. Since knowledge graphs comprise millions of facts, trying to manually identify true facts from false ones is not an option. On the other hand, a fully automated inspection of knowledge graphs compromises the perspective of actual humans and how they process facts, as opposed to how a machine solely processes the same facts. Hence, this paper meets the middle ground and leverages the human ability to judge facts to train a model that can learn to identify truth from false.
“We use machine learning to verify the correctness of the triples based on a set of features: subject, object, predicate, and the probabilistic soft truth confidence values."
Considering the problem of trying to identify the true facts/triples from false ones, we require the perspective of actual human evaluators. While gathering the human-evaluated truth values for each and every fact/triple in the knowledge base can be strenuous as well as unimaginably expensive, obtaining a subset of the triples evaluated is a comparably easier task, made available through platforms such as Amazon’s Mturk and crowdsourcing. So, having these human-evaluated truths for our dataset, we have applied classification algorithms such as the support vector machine, stochastic gradient descent, and the random forest classifier to knowledge bases such as NELL and YAGO. To train the model, we initially took a feature set that was made up of the triples themselves (i.e., the subject, object, and the predicate). Then, we resolved to the probabilistic soft logic (PSL), to introduce a quantitative feature that could illustrate the importance of each fact/triple within the entire dataset. PSL is a joint probabilistic reasoning framework that computes a soft truth value for each fact/triple. These soft truth values are considered as a confidence that the system has, that the fact is most probably true, based on the evidence it was provided with (i.e., the entire dataset). These soft truth values lie within the range of [0,1]. However, the uncertainty in manually finding a threshold that is dictated by the test accuracy that maximizes the prediction performance makes the clear and concise labeling of the facts as true or false, quite questionable. (If you are further interested in how PSL computes these soft truth values, refer to my blog “Inferring New Relationships using the Probabilistic Soft Logic”). The final feature that was used is the set of truth values (boolean truths) for each fact as evaluated by actual human evaluators. Since we are training the model on part of the entire dataset, the process of acquiring the human-evaluated truth values for this subset of triples was not expensive as opposed to a completely manual evaluation.
So, with these five features, the initial training was conducted on the NELL and YAGO datasets using the following classifiers.
One common question, which I am frequently asked, is with regard to the existence of the stochastic gradient descent (SGD) classifier. YES, it exists. Most of us know SGD as a mere process of nudging the weights and biases to optimize the predictions in training a neural network. Taking the same concept, the scikit learn library provides the SGD classifier, which is basically a classifier similar to the linear support vector machine or logistic regression, that uses the principles of SGD to address classification problems. Refer to the documentation on the scikit learn library here.
Moving on, the first experiment conducted was to train the above three classifiers on the feature set of the subject, object, predicate, and PSL soft truth values, with the human evaluated truth values as the target. The performance metrics, the precision, recall, and f1, were evaluated for these experiments on the NELL and YAGO datasets. As such, we were able to observe the following results.
Fig 1. Performance Index Values for the NELL dataset (experiment 1)
Fig 2. Performance Index Values for the YAGO dataset (experiment 2)
“The Baseline here is the performance index values for the triples, based on solely their soft truth values.”
According to the Knowledge Graph Identification paper by Pujara et al., the true facts were identified by means of a threshold in the soft truth values. For example, they considered different soft truth threshold values such as 0, 0.1, 0.2, …, 0.9, 1.0. The authors identified the facts that were scored a soft truth value greater than the threshold as true facts and the facts that scored lesser than the threshold as false facts. Based on these assumptions, the performance index metrics for the various threshold-based predictions were computed. Then, the threshold value for which the test accuracy (f1 score) was the greatest was considered as the baseline. This is the same concept used as the baseline in our experiments, with the NELL and YAGO datasets as well.
Now let’s observe the results. While there is a clear improvement when using a classifier to predict the most-probably-true facts, we can see that the random forest classifier shows a much better performance with an 11.96% increase in precision, a 2.53% increase in recall, and a 6.11% increase in its test accuracy (f1) as opposed to the baseline. With the YAGO dataset, the test accuracy improvement was a mere 1.38%, yet considered an improvement.
This led to the question of whether using only the five features (subject, object, predicate, PSL soft truth values, and the human evaluated truths) will be enough to constantly illustrate an improved prediction performance across all datasets.
“Moreover, we extend our solution model to introduce a novel empirical feature to quantify the importance of each triple. This feature was a derivative of the number of relationships or interaction between the subject-predicate-object triple. We term this as the fact strength, for referential purposes.”
Remember how we said that ‘the PSL soft truth value is considered as a confidence that the system has, that the fact is most probably true, based on the evidence it was provided with’? Well, the fact strength that we compute can be considered as the influence that one fact has within the entire dataset as opposed to all the other facts. We compute this using the following equation.
The value of the fact strength is always rendered as a positive integer greater than 0. For a sleek example of how the fact strength is computed, refer to our paper here.
Once the fact strength is computed for each and every fact, the training proceeds on with six features now. These features are the original five features along with the newly computed fact strength feature. Then, the evaluations were done under the same conditions, along with the newly introduced empirical feature. This proffered a new set of evaluation results that are shown below.
Fig 3. Performance Index Values for the NELL dataset with the fact strength (experiment 3)
Fig 4. Performance Index Values for the YAGO dataset with the fact strength (experiment 4)
With these two experiments, since we can clearly observe the improvement in the performance index metrics, let us contrast these results against the classification model that didn’t use the fact strength rather than contrasting them against the baseline. As such, taking the best-performing classifier (random forest classifier), on average, there is an improvement of 4.44% in the f1 score, an average precision improvement of 2.07%, and an increase in the recall by a trivial 0.42%, considering both the datasets.
Among the three classifiers, although we observed notable improvements in performance metrics, we further proceeded on to verify the classifier that was performing optimally. Hence, the classifier calibration test was done.
“The classifier calibration models the classifiers’ alignment as opposed to the best or ideal calibration for the actual predictions. The predicted truth value against the fraction of positives plots the graph in the form of y = mx + c.”
Fig 5. Classifier Calibration Plot.
As depicted, the random forest classifier’s predictions lie as close to the ideal, perfectly calibrated line, depicting the efficient prediction capabilities of the random forest classifier.
Hence, we conclude that the random forest classifier, trained on a feature set of the subject, object, predicate, PSL soft truth values, a subset of human evaluated truths, and the fact strength can perform better than its predecessor approaches in detecting erroneous facts in knowledge graphs.
We are currently looking into fine-tuning the input feature set as well as the prospects of Bayesian statistics in this regard.
 Amazon Mturk
Readers who are interested can refer to the following references that we found to be useful.