问题
I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:
objects <- mclapply(files, function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.
Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.
回答1:
The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.
A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.
Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.
Just loop over the smaller pieces:
collector = vector("list", length(file_map)) # more complex than normal for speed
for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
on.exit(message(sprintf("Completed: %s", x)))
message(sprintf("Started: '%s'", x))
readBamGappedAlignments(x)
}, mc.cores=10)
collector[[index]]= reduced_set
}
output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists
Alternately, save the output to a database as you go such as SQLite.
来源:https://stackoverflow.com/questions/5775064/mclapply-with-big-objects-serialization-is-too-large-to-store-in-a-raw-vector