I have written the program (below) to:
pandas dataframegroupby using a specific column value
When you use multiprocessing.Pool a number of child processes will be created using the fork() system call. Each of those processes start off with an exact copy of the memory of the parent process at that time. Because you're loading the csv before you create the Pool of size 3, each of those 3 processes in the pool will unnecessarily have a copy of the data frame. (gen_matrix_df as well as gen_matrix_df_list will exist in the current process as well as in each of the 3 child processes, so 4 copies of each of these structures will be in memory)
Try creating the Pool before loading the file (at the very beginning actually) That should reduce the memory usage.
If it's still too high, you can:
Dump gen_matrix_df_list to a file, 1 item per line, e.g:
import os
import cPickle
with open('tempfile.txt', 'w') as f:
for item in gen_matrix_df_list.items():
cPickle.dump(item, f)
f.write(os.linesep)
Use Pool.imap() on an iterator over the lines that you dumped in this file, e.g.:
with open('tempfile.txt', 'r') as f:
p.imap(matrix_to_vcf, (cPickle.loads(line) for line in f))
(Note that matrix_to_vcf takes a (key, value) tuple in the example above, not just a value)
I hope that helps.
NB: I haven't tested the code above. It's only meant to demonstrate the idea.