I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minim
It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.
Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).
What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.
So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.
(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).
Hints:
Re. bounty:
C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.
Using strtok_r
instead of str.split
may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.