Problems with gensim WikiCorpus - aliasing chunkize to chunkize_serial; (__mp_main__ instead of __main__?)

一曲冷凌霜 提交于 2021-01-05 06:48:32

问题


I'm quite new to Python and coding in general, so I seem to have run into an issue.

I'm trying to run this code (credit to Matthew Mayo, whole thing can be found here):

# import warnings
# warnings.filterwarnings(action = 'ignore', category = UserWarning, module = 'gensim')
import sys
from gensim.corpora import WikiCorpus

def make_corpus (in_f, out_f):
    print(0)
    output = open(out_f, 'w', encoding = 'utf-8')
    print(1)
    wiki = WikiCorpus(in_f)
    print(2)
    i = 0
    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '/n')
        i += 1
        if i % 10000 == 0:
            print('Processed {} articles!'.format(i))
    print(3)
    output.close()
    print('Process complete!')



print('start')
if __name__ == '__main__':
    if len(sys.argv) != 3:
        print('Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
        sys.exit(1)
    in_f = sys.argv[1]
    out_f = sys.argv[2]
    make_corpus(in_f, out_f)
else:
    print(__name__)

However, the function branch seems to run partly, stopping at the wiki = WikiCorpus(in_f) - it never makes it to print(2) - and then exiting and repeating the beginning of the code, yielding no results. No error actually comes up, only a warning (UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")).

The output is this:

start
0
1
C:\Users\name\Anaconda3\lib\site-packages\gensim\utils.py:1254: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
start
__mp_main__
start
__mp_main__
start
__mp_main__

I've tried uninstalling all required packages (numpy, smart_open), as well as gensim itself (in an active conda environment), but nothing has changed. Also, what is the difference between the main and the multiprocessing one?

-- Specifications: win64, py 3.7.3

Edit: after running logging at the DEBUG level, logging file

2020-02-16 22:49:00,061:start: :13396 
2020-02-16 22:49:00,061:0 :13396 
2020-02-16 22:49:00,061:1 :13396 
2020-02-16 22:49:01,493:start: :22356 
2020-02-16 22:49:01,493:3 :22356 
2020-02-16 22:49:01,496:start: :25332 
2020-02-16 22:49:01,497:3 :25332 
2020-02-16 22:49:01,530:start: :7120 
2020-02-16 22:49:01,530:3 :7120 
2020-02-16 22:49:01,541:adding document #0 to Dictionary(0 unique tokens: []):13396

(also, the '3' was added in the else branch:)

else:
    logging.debug('3 ')

回答1:


Windows OS may be a contributing factor; a lot of things related to multiprocessing work differently there, while gensim gets a lot more use & testing elsewhere. So if it is an option to test your code under another OS, or potentially use another OS entirely, this problem & other potential future problems may become irrelevant.

Other things to check & try:

  • does the wiki_en.txt file get created, or receive any output, at all?

  • does it help if you supply processes=1 as an argument to WikiCorpus, so that only one worker process is used?

  • if you tested some code that doesn't try using WikiCorpus at all, by tries to read-through the raw wiki dump, using BZ2File to uncompress it, in the same style as gensim's wikicorpus.py source code, does that work, or also show a similar problem? (If there is a similar problem, then it's a usefully smaller triggering case that focuses attention on BZ2File's operation on Windows.)

  • are you by chance using Wikipedia's "multistream" BZ2 file, and if so, could you try the non-multistream alternative & see if the same problem persists (in case this is an issue with BZ2File & multistream on Windows)?



来源:https://stackoverflow.com/questions/60248118/problems-with-gensim-wikicorpus-aliasing-chunkize-to-chunkize-serial-mp-ma

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!