Fast way to download a really big (14 million row) csv from a zip file? Unzip and read_csv and read.csv never stop loading

给你一囗甜甜゛ 提交于 2021-01-28 21:14:19

问题


I am trying to download the dataset at the below link. It is about 14,000,000 rows long. I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.

I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step. Any ideas? This is a public dataset so anyone can try.

library(vroom)

temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)


unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")


df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")

unlink(temp)


回答1:


Since the data set is quite large, 2 possible solutions:

With data.table (very fast, only feasible if the data fits into memory)

require(data.table)

system('curl https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip > hmda_2017_nationwide_all-records_labels.zip && unzip hmda_2017_nationwide_all-records_labels.zip')

dat <- fread("hmda_2017_nationwide_all-records_labels.csv")
# System errno 22 unmapping file: Invalid argument
# Error in fread("hmda_2017_nationwide_all-records_labels.csv") : 
#   Opened 10.47GB (11237068086 bytes) file ok but could not memory map it.
# This is a 64bit process. There is probably not enough contiguous virtual memory available.

With readLines (read data step-wise)

f <- file("./hmda_2017_nationwide_all-records_labels.csv", "r")
# if header:
header <- unlist(strsplit(unlist(strsplit(readLines(f, n=1), "\",\"")), ","))

dd <- as.data.frame(t(data.frame(strsplit(readLines(f, n=100), "\",\"") )))
colnames(dd) <- header
rownames(dd) <- 1:nrow(dd)

Repeat and add to the data frame if needed:

de <- t(as.data.frame( strsplit(readLines(f, n=10), "\",\"") ) )
colnames(de) <- header
dd <- rbind( dd, de )
rownames(dd) <- 1:nrow(dd)

close(f)

Use seek to jump within the data.




回答2:


I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:

library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")

I get a warning about possible truncation, but the object has these dimensions:

> dim(df2017)
[1] 5448288      78

one nice thing about vroom, is that it doesn't load the data straight into memory.



来源:https://stackoverflow.com/questions/65401851/fast-way-to-download-a-really-big-14-million-row-csv-from-a-zip-file-unzip-an

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!