Read specific (non-consecutive) rows from csv

半城伤御伤魂 提交于 2021-01-27 13:07:05

问题


I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?

The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).


回答1:


A simple example of how you might combine the sed approach I linked to into an R function:

read_rows <- function(file,rows,...){
  tmp <- tempfile()
  row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
  cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
  system(command = cmd)
  read.table(file = tmp,...)
}

write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")

    > read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
                    V2 V3  V4  V5   V6    V7    V8 V9 V10 V11 V12
Mazda RX4 Wag     21.0  6 160 110 3.90 2.875 17.02  0   1   4   4
Hornet Sportabout 18.7  8 360 175 3.15 3.440 17.02  0   0   3   2
Valiant           18.1  6 225 105 2.76 3.460 20.22  1   0   3   1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

Note the difference with row 1 as the column headers.




回答2:


sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.

library(sqldf)

Rows <- c(3, 5, 10)

s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
   dbname = tempfile())

giving:

     X Letters
1  "3"     "c"
2  "5"     "e"
3 "10"     "j"

If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:

library(sqldf)

Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)

s <- toString(Rows)
fn$read.csv.sql("Letters.csv", 
  "select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
  dbname = tempfile())

Note

Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")


来源:https://stackoverflow.com/questions/57481017/read-specific-non-consecutive-rows-from-csv

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!