问题
I have a 700mb .dta Stata file with 28 million observations and 14 column variables
When I attempt to import into R using foreign's read.dta() function I run out of RAM on my 8GB machine (page outs shoot into GBs very quickly).
staph <- read.dta("Staph_1999_2010.dta")
I hunted around and it sounds like a more efficient alternative would be to use the Stata.file() function from the memisc package.
When I call:
staph <- Stata.file("Staph_1999_2010.dta")
I get a segfault:
*** caught segfault ***
address 0xd5d2b920, cause 'memory not mapped'
Traceback:
1: .Call("dta_read_labels", bf, lbllen, padding)
2: dta.read.labels(bf, len.lbl, 3)
3: get.dictionary.dta(dta)
4: Stata.file("Staph_1999_2010.dta")
I find the documentation for Stata.file() difficult to follow.
(1) Am I using Stata.file()
correctly?
(2) Does Stata.file()
return a dataframe like read.dta() does?
(3) If I'm using Stata.file()
correctly, how can I fix the error I'm getting?
回答1:
With access to Stata, one solution to export the .dta to .csv in Stata.
use "file.dta"
export delimited using "file.csv", replace
And then import in R using read.csv
or data.table::fread
.
Other ideas:
- Consider sampling a bit of the data using
sample
in Stata Stata's - Stata
compress
attempts a lossless compression by changing types (not
sure it would save much for the .csv and R though). - Pack the data tight by converting to integer any dates or string IDs if possible.
- Use a cloud instance for a one time import, and initial cleansing, before sampling or keeping only the important part
- Get more RAM...
来源:https://stackoverflow.com/questions/19028165/using-memisc-to-import-stata-dta-file-into-r