Using memisc to import stata .dta file into R

℡╲_俬逩灬. 提交于 2019-12-10 13:49:10

问题


I have a 700mb .dta Stata file with 28 million observations and 14 column variables

When I attempt to import into R using foreign's read.dta() function I run out of RAM on my 8GB machine (page outs shoot into GBs very quickly).

staph <- read.dta("Staph_1999_2010.dta")

I hunted around and it sounds like a more efficient alternative would be to use the Stata.file() function from the memisc package.

When I call:

staph <- Stata.file("Staph_1999_2010.dta")

I get a segfault:

*** caught segfault ***
address 0xd5d2b920, cause 'memory not mapped'

Traceback:
 1: .Call("dta_read_labels", bf, lbllen, padding)
 2: dta.read.labels(bf, len.lbl, 3)
 3: get.dictionary.dta(dta)
 4: Stata.file("Staph_1999_2010.dta")

I find the documentation for Stata.file() difficult to follow.

(1) Am I using Stata.file() correctly?

(2) Does Stata.file() return a dataframe like read.dta() does?

(3) If I'm using Stata.file() correctly, how can I fix the error I'm getting?


回答1:


With access to Stata, one solution to export the .dta to .csv in Stata.

use "file.dta"

export delimited using "file.csv", replace

And then import in R using read.csv or data.table::fread.

Other ideas:

  • Consider sampling a bit of the data using sample in Stata Stata's
  • Stata compress attempts a lossless compression by changing types (not
    sure it would save much for the .csv and R though).
  • Pack the data tight by converting to integer any dates or string IDs if possible.
  • Use a cloud instance for a one time import, and initial cleansing, before sampling or keeping only the important part
  • Get more RAM...


来源:https://stackoverflow.com/questions/19028165/using-memisc-to-import-stata-dta-file-into-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!