Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

百般思念 提交于 2019-12-17 15:16:36

问题


I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file.

The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want.

Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the question how to call a shell script and capture its output in R?).


回答1:


Sometimes I do something like this when I have the data in a tab-delimited file:

df <- read.table(pipe("cut -f1,5,28 myFile.txt"))

That lets cut do the data selection, which it can do without using much memory at all.

See Only read limited number of columns for pure R version, using "NULL" in the colClasses argument to read.table.




回答2:


One possibility is to use pipe() in lieu of the filename and have awk or similar filters extract only the columns you want.

See help(connection) for more on pipe and friends.

Edit: read.table() can also do this for you if you are very explicit about colClasses -- a value of NULL for a given column skips the column alltogether. See help(read.table). So there we have a solution in base R without additional packages or tools.




回答3:


I think Dirk's approach is straight forward as well as fast. An alternative that I've used is to load the data into sqlite which loads MUCH faster than read.table() and then pull out only what you want. the package sqldf() makes this all quite easy. Here's a link to a previous stack overflow answer that gives code examples for sqldf().




回答4:


This is probably more than you need, but if you're operating on very large data sets then you might also have a look at the HadoopStreaming package which provides a map-reduce routine using Hadoop.



来源:https://stackoverflow.com/questions/2193742/ways-to-read-only-select-columns-from-a-file-into-r-a-happy-medium-between-re

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!