Long lag time importing large .CSV's in R WITH header in second row

天涯浪子 提交于 2019-12-05 08:45:39
Nathaniel Payne

Here is the solution that I ended up going with & which worked nicely:

start_time <- Sys.time() # Calculate time diff on the big files

library(bit64)

report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",")

end_time <- Sys.time() # Calculate time diff on the big files
time_diff <- end_time - start_time # Calculate the time difference
# time_diff = 1.068 seconds

The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that:

Warning message:
In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv",  :
  Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.

This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64)


Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.

I would use awk for this sort of problem. Awk can be called inside R like this:

system("awk '{if (NR!=1) {print}}' a.csv > a2.csv")

where a is your sample file and a2 is the file with the first row removed.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!