How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

后端 未结 2 1321
半阙折子戏
半阙折子戏 2021-01-19 00:39

I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search

2条回答
  •  我在风中等你
    2021-01-19 01:26

    The problem lies on your defined tokenize func

    def tokenize(content):
        return [token.encode('utf8') for token in utils.tokenize(content, 
                lower=True, errors='ignore') if len(token) <= 15 and not 
                token.startswith('_')]
    

    The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.

    For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]

    Anyway, you can define your own tokenize func as follow to retain punctuations.

    def tokenize(content):
        #override original method in wikicorpus.py
        return [token.encode('utf8') for token in content.split() 
               if len(token) <= 15 and not token.startswith('_')]
    

提交回复
热议问题