Fast way to download a really big (14 million row) csv from a zip file? Unzip and read_csv and read.csv never stop loading

问题

I am trying to download the dataset at the below link. It is about 14,000,000 rows long. I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.

I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step. Any ideas? This is a public dataset so anyone can try.

library(vroom)

temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)


unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")


df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")

unlink(temp)

回答1:

Since the data set is quite large, 2 possible solutions:

With data.table (very fast, only feasible if the data fits into memory)

require(data.table)

system('curl https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip > hmda_2017_nationwide_all-records_labels.zip && unzip hmda_2017_nationwide_all-records_labels.zip')

dat <- fread("hmda_2017_nationwide_all-records_labels.csv")
# System errno 22 unmapping file: Invalid argument
# Error in fread("hmda_2017_nationwide_all-records_labels.csv") : 
#   Opened 10.47GB (11237068086 bytes) file ok but could not memory map it.
# This is a 64bit process. There is probably not enough contiguous virtual memory available.

With readLines (read data step-wise)

f <- file("./hmda_2017_nationwide_all-records_labels.csv", "r")
# if header:
header <- unlist(strsplit(unlist(strsplit(readLines(f, n=1), "\",\"")), ","))

dd <- as.data.frame(t(data.frame(strsplit(readLines(f, n=100), "\",\"") )))
colnames(dd) <- header
rownames(dd) <- 1:nrow(dd)

Repeat and add to the data frame if needed:

de <- t(as.data.frame( strsplit(readLines(f, n=10), "\",\"") ) )
colnames(de) <- header
dd <- rbind( dd, de )
rownames(dd) <- 1:nrow(dd)

close(f)

Use seek to jump within the data.

回答2:

I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:

library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")

I get a warning about possible truncation, but the object has these dimensions:

> dim(df2017)
[1] 5448288      78

one nice thing about vroom, is that it doesn't load the data straight into memory.

来源：https://stackoverflow.com/questions/65401851/fast-way-to-download-a-really-big-14-million-row-csv-from-a-zip-file-unzip-an

标签

csv

zip

large-files