Using memisc to import stata .dta file into R

问题

I have a 700mb .dta Stata file with 28 million observations and 14 column variables

When I attempt to import into R using foreign's read.dta() function I run out of RAM on my 8GB machine (page outs shoot into GBs very quickly).

staph <- read.dta("Staph_1999_2010.dta")

I hunted around and it sounds like a more efficient alternative would be to use the Stata.file() function from the memisc package.

When I call:

staph <- Stata.file("Staph_1999_2010.dta")

I get a segfault:

*** caught segfault ***
address 0xd5d2b920, cause 'memory not mapped'

Traceback:
 1: .Call("dta_read_labels", bf, lbllen, padding)
 2: dta.read.labels(bf, len.lbl, 3)
 3: get.dictionary.dta(dta)
 4: Stata.file("Staph_1999_2010.dta")

I find the documentation for Stata.file() difficult to follow.

(1) Am I using Stata.file() correctly?

(2) Does Stata.file() return a dataframe like read.dta() does?

(3) If I'm using Stata.file() correctly, how can I fix the error I'm getting?

回答1:

With access to Stata, one solution to export the .dta to .csv in Stata.

use "file.dta"

export delimited using "file.csv", replace

And then import in R using read.csv or data.table::fread.

Other ideas:

Consider sampling a bit of the data using sample in Stata Stata's
Stata compress attempts a lossless compression by changing types (not
sure it would save much for the .csv and R though).
Pack the data tight by converting to integer any dates or string IDs if possible.
Use a cloud instance for a one time import, and initial cleansing, before sampling or keeping only the important part
Get more RAM...

来源：https://stackoverflow.com/questions/19028165/using-memisc-to-import-stata-dta-file-into-r

标签

memory

stata

dta