Load a small random sample from a large csv file into R data frame

一笑奈何 提交于 2019-11-27 05:47:25

问题


The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?


回答1:


You can also just do it in the terminal with perl.

perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt

This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.




回答2:


Try this based on examples 6e and 6f on the sqldf github home page:

library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")

See ?read.csv.sql using other arguments as needed based on the particulars of your file.




回答3:


This should work:

RowsInCSV = 10000000 #Or however many rows there are

List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)



回答4:


The following can be used in case you have an ID or something similar in your data. Take a sample of IDs, then take the subset of the data using the sampled ids.

sampleids <- sample(data$id,1000)
newdata <- subset(data, data$id %in% sampleids)


来源:https://stackoverflow.com/questions/22261082/load-a-small-random-sample-from-a-large-csv-file-into-r-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!