I\'m trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After search
In gensim/utils.py you find the method
def save_as_line_sentence(corpus, filename):
with smart_open(filename, mode='wb', encoding='utf8') as fout:
for sentence in corpus:
line = any2unicode(' '.join(sentence) + '\n')
fout.write(line)
that you can use to write the corpus into a textfile. You can override it or take it as example and and write your own version of it (maybe you want to break the lines at each punctuation) like
def save_sentence_each_line(corpus, filename):
with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
for sentence in corpus:
line = utils.any2unicode(' '.join(sentence) + '\n')
line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
...
you can call it like
save_sentence_each_line(wiki.get_texts(), out_f)
but you also need to override PAT_ALPHABETIC from utils, too, because thats where the punctuation gets deleted:
PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)
You may then need to override utils.tokenize and utils.simple_tokenize in case you want to make further changes to the code.