Counting rows with fread without reading the whole file [duplicate]

99封情书 提交于 2019-12-01 13:09:12

问题


I want to use data.table to process a very big file. It doesn't fit on memory. I've thought on reading the file on chunks using a loop with (increasing properly the skip parameter).

fread("myfile.csv", skip=loopindex, nrows=chunksize) 

processing each of this chunks and appending the resulting output with fwrite.

In order to do it properly I need to know the total number of rows, without reading the whole file.

What's the proper/faster way to do it?

I can ony think in reading only the first column but maybe there is an special command or trick. or maybe there is an automatic way to detect the end of the file.


回答1:


1) count.fields Not sure if count.fields reads the whole file into R at once. Try it to see if it works.

length(count.fields("myfile.csv", sep = ","))

If the file has a header subtract one from the above.

2) sqldf Another possibility is:

library(sqldf)
read.csv.sql("myfile.csv", sep = ",", sql = "select count(*) from file")

You may need other arguments as well depending on header, etc. Note that this does not read the file into R at all -- only into sqlite.

3) wc Use the system command wc which should be available on all platforms that R runs on.

shell("wc -l myfile.csv", intern = TRUE)

or to directly get the number of lines in the file

read.table(pipe("wc -l myfile.csv"))[[1]]

or

read.table(text = shell("wc -l myfile.csv", intern = TRUE))[[1]]

Again, if there is a header subtract one.

If you are on Windows be sure that Rtools is installed and use this:

read.table(pipe("C:\\Rtools\\bin\\wc -l myfile.csv"))[[1]]

Alternately on Windows without Rtools try this:

read.table(pipe('find /v /c "" myfile.csv'))[[3]]

See How to count no of lines in text file and store the value into a variable using batch script?




回答2:


The answer by @G. Grothendieck about using wc -l is a good one, if you can rely on it being present.

You might also want to look into iterating through the file in chunks, e.g. by employing something like this answer that only relies on base R functions.

Since you don't need to read single lines, you can read in a batch from a connection. For instance:

count_lines = function(filepath, batch) {
    con = file(filepath, "r")
    n = 0
    while ( TRUE ) {
        lines = readLines(con, n = batch)
        present = length(lines)
        n = n + present
        if ( present <  batch) {
            break
        }
    }
    close(con)
    return(n)
}

Then you could read the file in, at say 1,000 lines at a time:

count_lines("filename.txt", 1000)


来源:https://stackoverflow.com/questions/39691133/counting-rows-with-fread-without-reading-the-whole-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!