问题
I'm doing some analysis in R where I need to work with some large datasets (10-20GB, stored in .csv, and using the read.csv function).
As I will also need to merge and transform the large .csv files with other data frames, I don't have the computing power or memory to import the entire file.
I was wondering if anyone knows of a way to import a random percentage of the csv.
I have seen some examples where people have imported the entire file and then used a separate function to create another data frame that is a sample of the original, however I am hoping for something a little less intensive.
回答1:
I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table
or fread
(data.table package)) .
Using perl
you can easily do this task. For example , to read 1% of your file in a random way, you can do this :
xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)
Here I am calling it from R using system
. xx contain now only 1% of your file.
You can wrap all this in a function:
read_partial_rand <-
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}
来源:https://stackoverflow.com/questions/27981460/importing-and-extracting-a-random-sample-from-a-large-csv-in-r