Read large file in parallel?

后端 未结 6 1889
被撕碎了的回忆
被撕碎了的回忆 2020-12-05 04:50

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minim

6条回答
  •  南方客
    南方客 (楼主)
    2020-12-05 05:47

    It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.

    Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).

    What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.

    So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.

    (If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).


    Hints:

    • ... push chunks to a synchronized queue ...
    • ... one or more consumer threads ...

    Re. bounty:

    C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.

    Using strtok_r instead of str.split may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.

提交回复
热议问题