Extract the labels attribute from “labeled” tibble columns from a haven import from Stata

问题

Hadley Wickham's haven package, applied to a Stata file, returns a tibble with many columns of type "labeled". You can see these with str(), e.g.:

$ MSACMSZ    :Class 'labelled'  atomic [1:8491861] NA NA NA NA NA NA NA NA NA NA ...
  .. ..- attr(*, "label")= chr "metropolitan area size (cmsa/msa)"
  .. ..- attr(*, "labels")= Named int [1:7] 0 1 2 3 4 5 6
  .. .. ..- attr(*, "names")= chr [1:7] "not identified or nonmetropolitan" "100,000 - 249,999" "250,000 - 499,999" "500,000 - 999,999" ...

It would be nice if I could simply extract all these labeled vectors to factors, but I have compared the length of the labels attribute to the number of unique values in each vector, and it is sometimes longer and sometimes shorter. So I think I need to look at all of them and decide how to handle each one individually.

So I would like to extract the values of the labels attribute to a list. However, this function:

labels93 <- lapply(cps_00093.df, function(x){attr(X, which="labels", exact=TRUE)})

returns NULL for all variables.

Is this a tibble vs data frame problem? How do I extract these attributes from the tibble columns into a list?

Note that the labels vector is named, and I need both the labels and the names.

As per @Hack-R's request here is a tiny snippet of my data as converted by dput (which I had never used before). I applied this code:

filter(cps_00093.df, YEAR==2015) %>%
  sample_n(10)  %>%
  select(HHTENURE, HHINTYPE) -> tiny
dput(tiny, file = "tiny")

to produce the file tiny. Hey! That was easy! I thought it would be hard to break off a piece this small.

Opening tiny with Notepad++, this is what I found:

structure(list(HHTENURE = structure(c(2L, 1L, 1L, 2L, 1L, 1L, 
1L, 2L, 1L, 1L), labels = structure(c(0L, 1L, 2L, 3L, 6L, 7L), .Names = c("niu", 
"owned or being bought", "rented for cash", "occupied without payment of cash rent", 
"refused", "don't know")), class = "labelled"), HHINTYPE = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), labels = structure(1:3, .Names = c("interview", 
"type a non-interview", "type b/c non-interview")), class = "labelled")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("HHTENURE", 
"HHINTYPE"))

I suspect this could be made more readable with a little spacing, but I did not want to muck with it for fear of accidentally destroying relevant information.

回答1:

The original question asks how 'to extract the values of the labels attribute to a list.' A solution to the main question follows (assuming some_df is imported via haven and has label attributes):

library(purrr)
n <- ncol(some_df)
labels_list <- map(1:n, function(x) attr(some_df[[x]], "label") )

# if a vector of character strings is preferable
labels_vector <- map_chr(1:n, function(x) attr(some_df[[x]], "label") )

回答2:

I'm going to take a go at answering this one, though my code isn't very pretty.

First I make a function to extract a named attribute from a single column.

ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
  atr <- attr(x, attrC, exact = TRUE)
  atr <- if (is.null(atr)) {ifIsNull} else {atr}
  atr
}

Then a function to lapply it to all the columns:

AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
  lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}

Finally I run it for each attribute.

stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)

labels93 <- AtribLst(cps_00093.df, attrC="labels", isNullC=NA)
labels93 <- labels93[!is.na(labels93)]

All the columns have a "label" attribute, but only some are of type "labeled" and so have a "labels" attribute. The labels attribute is named, where the labels match values of the data and the names tell you what those values signify.

回答3:

Jumping off @omar-waslow answer above, but adding the use of attr_getter.

If the data (some_df) is imported using read_dta in the haven package, then each column in the tibble has an attr called "label". So we split up the dataframe, going column by column. This creates a two column dataframe which can be joined back (after pivot_longer, for example).

library(tidyverse)
label_lookup_map <- tibble(
   col_name = some_df %>% names(),
   labels = some_df %>% map_chr(attr_getter("label"))
)

来源：https://stackoverflow.com/questions/39671621/extract-the-labels-attribute-from-labeled-tibble-columns-from-a-haven-import-f

标签

data-structures

attributes

stata

r-haven