External shuffle: shuffling large amount of data out of memory
I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB). I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM. The only solution I thought of is to shuffle an array containing the numbers from 1 to N , where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations,