问题
I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?
The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).
回答1:
A simple example of how you might combine the sed approach I linked to into an R function:
read_rows <- function(file,rows,...){
tmp <- tempfile()
row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
system(command = cmd)
read.table(file = tmp,...)
}
write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Note the difference with row 1 as the column headers.
回答2:
sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.
library(sqldf)
Rows <- c(3, 5, 10)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
dbname = tempfile())
giving:
X Letters
1 "3" "c"
2 "5" "e"
3 "10" "j"
If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:
library(sqldf)
Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv",
"select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
dbname = tempfile())
Note
Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")
来源:https://stackoverflow.com/questions/57481017/read-specific-non-consecutive-rows-from-csv