I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search
The problem lies on your defined tokenize func
def tokenize(content):
return [token.encode('utf8') for token in utils.tokenize(content,
lower=True, errors='ignore') if len(token) <= 15 and not
token.startswith('_')]
The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.
For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]
Anyway, you can define your own tokenize func as follow to retain punctuations.
def tokenize(content):
#override original method in wikicorpus.py
return [token.encode('utf8') for token in content.split()
if len(token) <= 15 and not token.startswith('_')]