How to perform entity linking to local knowledge graph?

倾然丶 夕夏残阳落幕 提交于 2021-02-04 16:22:36

问题


I'm building my own knowledge base from scratch, using articles online.

I am trying to map the entities from my scraped SPO triples (the Subject and potentially the Object) to my own record of entities which consist of listed companies which I scraped from some other website.

I've researched most of the libraries, and the method are focused on mapping entities to big knowledge bases like Wikipedia, YAGO, etc., but I'm not really sure how to apply those techniques to my own knowledge base.

Currently, I've found the NEL Python package that claims to be able to do so, but I don't quite understand the documentation, and it focus only on a Wikipedia data dump.

Is there any techniques or libraries that allows me to do so?


回答1:


I assume you have something similar to wikidata knowledge base that is a giant list of concepts with aliases.

More or less this can be represented as follow:

C1 new york
C1 nyc
C1 big apple

Now the link a spans of a sentence to the above KB, for single words it is easy, you just have to setup a index that maps a single word concept to an identifier.

The difficult part is linking multiple word concepts or phrasal concepts like "new york" or "big apple".

To achieve that I use an algorithm that splits a sentence into all the slices possible. I call those "spans". Then try to match individual span or group of words with a concept from the database (single word or with multiple words).

For instance, here is example of all the spans for a simple sentence. It is a list that store lists of strings:

[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]

Each sublist may or may not map to a concept. To find the best mapping, you can score each of the above line based on the number of concept that match.

Here is two of the above list of spans that have the best score according to the example knowledge base:

2  ~  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2  ~  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

So it guessed "new york" is concept and "big apple" is also a concept.

Here is the full code:

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

This can must be improved to associate list that match a concept to its concept identifier, and find a way to spell check everything (according to the knowledge base).



来源:https://stackoverflow.com/questions/52046394/how-to-perform-entity-linking-to-local-knowledge-graph

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!