问题
I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below:
### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ###
system.time(read.csv('../data/2008.csv', header = T))
# user system elapsed
# 88.301 2.416 90.716
library(data.table)
system.time(fread('../data/2008.csv', header = T, sep = ','))
# user system elapsed
# 4.740 0.048 4.785
library(bigmemory)
system.time(read.big.matrix('../data/2008.csv', header = T))
# user system elapsed
# 59.544 0.764 60.308
library(ff)
system.time(read.csv.ffdf(file = '../data/2008.csv', header = T))
# user system elapsed
# 60.028 1.280 61.335
library(sqldf)
system.time(read.csv.sql('../data/2008.csv'))
# user system elapsed
# 87.461 3.880 91.447
The challenge I am having is this. The .csv in question has headers in the second row and a first row that is filled with useless information. My initial approach (successfully applied to smaller files less than 5MB) was to used the following code for the import on smaller files after the first row was removed.
report_query_X_all_content = readLines("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv")
skip_first = report_query_X_all_content[-1]
report_query_X = read.csv(textConnection(skip_first), header = TRUE, stringsAsFactors = FALSE)
Unfortunately, once the base file broke 70 or 80MB in size, the import time seems to increase exponentially. Most of the functions that I have been looking at, like fread(), required you to pass in the .csv directly. As you can see in my implementation, I passed in skip_first through textConnection after removing my desired row. The problem I am having is that, for 70 or 80MB files, there is a disproportionate lag in time. I started one import nearly 55 minutes ago and it is still running for a 79MB file. For context, skip_first is appearing in internal memory with a size of about 95MB. My next import is about 785MB. Does anyone have any suggestions or recommendations on how to accomplish what I am looking to do with larger data files. Eventually, this solution will be applied to .csv files that are as large as 1 - 4 GB in size & I am worried that the textConnection() step is causing a bottleneck.
回答1:
Here is the solution that I ended up going with & which worked nicely:
start_time <- Sys.time() # Calculate time diff on the big files
library(bit64)
report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",")
end_time <- Sys.time() # Calculate time diff on the big files
time_diff <- end_time - start_time # Calculate the time difference
# time_diff = 1.068 seconds
The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that:
Warning message:
In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv", :
Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.
This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64)
Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.
回答2:
I would use awk for this sort of problem. Awk can be called inside R like this:
system("awk '{if (NR!=1) {print}}' a.csv > a2.csv")
where a is your sample file and a2 is the file with the first row removed.
来源:https://stackoverflow.com/questions/24921387/long-lag-time-importing-large-csvs-in-r-with-header-in-second-row