R data.table how to replace positive values with column names across multiple binary data columns

故事扮演 提交于 2019-12-23 04:42:17

问题


I'm using R v. 3.2.1 and data.table v 1.9.6. I have a data.table like the example below, which contains some coded binary columns classed as character with the values "0" and "1" and also a string vector that contains phrases with some of the same words as the binary column names. My ultimate goal is to create a wordcloud using both the words in the string vector and also the positive responses in the binary vectors. To do this, I first need to convert the positive responses in the binary vectors to their column names, but there is where I'm getting stuck.

A similar question has been asked here but it is not quite the same as the poster starts with a matrix and the suggested solution does not seem to work with a more complicated data set. I also have columns other than my binary columns which have ones in them, so the solution needs to first accurately identify my binary columns.

Here is some example data:

id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)
dt <- data.table(df)
dt[, id := as.numeric(id)]

Here is what the data looks like:

    id age apple pear banana                                          favfood
1:  1   5     0    1      0                                i love pear juice
2:  2   1     1    1      1 i eat chinese pears and crab apples every sunday
3:  3  11    NA    1      1                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

Thus the wordcloud should should have a frequency of 1 for apples if apple==1 or favfood cointains the string "apple" or both, and so on.

Here is my attempt (which doesn't do what I want, but gets about half way):

# First define the logic columns.
# I've done this by name here but in my real data set this won't work because there are too many    
logicols <- c("apple", "pear", "banana")

# Next identify the location of the "1"s within the subset of logic columns:
ones <- which(dt==1 & colnames(dt) %in% logicols, arr.ind=T)

# Lastly, convert the "1"s in the subset to their column names:
dt[ones, ]<-colnames(dt)[ones[,2]]

This gives:

> dt
   id age apple pear banana                                          favfood
1:  1   5     0 pear      0                                i love pear juice
2:  2   1     1 pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA    1 banana                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

There are two problems with this approach:

(a) Identifying the columns to convert by name is not convenient for my real data set because there are many of them. How can I identify this subset of columns without including other columns that contain 1s but have other values in them as well (in this example "age" contains a 1 but it is clearly not a logic column)? I have deliberately coded "age" as a character column in the example as in my real data set, there are character columns that contain 1s that are not logic columns. The feature that sets them apart is that my logic columns are character but only contain the values 0, 1 or are missing (NA).

(b) The index has not picked up all the 1s in the logic columns, does anyone know why this is (e.g. the 1 in the second row of the "apple" column is not converted)?

Many thanks for your help - I'm sure I'm missing something relatively simple, but quite stuck on this.


回答1:


Thanks to @Frank for pointing out that the logic/binary columns should have been converted to the correct class with as.logical().

This greatly simplifies identification of the values to change and the indexing now seems to work as well:

# Starting with the data in its original format:
id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)

# Convert the "0" / "1" character columns to logical with a function:

    > recode.multi
    function(data, recode.cols, old.var, new.var, format = as.numeric){
      # function to recode multiple columns 
      #
      # Args:        data: a data.frame 
      #       recode.cols: a character vector containing the names of those 
      #                    columns to recode
      #           old.var: a character vector containing values to be recorded
      #           new.var:  a character vector containing desired recoded values
      #            format: a function descrbing the desired format e.g.
      #                    as.character, as.numeric, as.factor, etc.. 

      # check from and to are of equal length
      if(length(old.var) == length(new.var)){
        NULL
      } else {
        stop("'from' and 'to' are of differing lengths")
      }

      # convert format of selected columns to character
      if(length(recode.cols) == 1){
        data[, recode.cols] = as.character(data[, recode.cols])
      } else {
        data[, recode.cols] = data.frame(lapply(data[, recode.cols], as.character), stringsAsFactors=FALSE)
      }


      # recode old variables to new variables for selected columns
      for(i in 1:length(old.var)){
        data[, recode.cols][data[, recode.cols] == old.var[i]] = new.var[i]
      }


  # convert recoded columns to desired format 
  data[, recode.cols] = sapply(data[, recode.cols], format)

  data
}

df = recode.multi(data = df, recode.cols = c(unlist(strsplit("apple pear banana", split=" "))), old.var = c("0", "1", NA), new.var = c(FALSE, TRUE, NA), format = as.logical)

dt <- data.table(df)
dt[, id := as.numeric(id)]

# Identify the values to swap with column names:
convtoname <- which(dt==TRUE, arr.ind=T)

# Make the swap:
dt[convtoname, ]<-colnames(dt)[convtoname[,2]]

This gives the desired result:

> dt
   id age apple  pear banana                                          favfood
1: id   5 FALSE  pear  FALSE                                i love pear juice
2:  2   1 apple  pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA  pear banana                           i also like apple tart
4:  4  20 apple FALSE     NA                          i like crab apple juice
5:  5  21 FALSE FALSE banana                 i hate most fruit except bananas


来源:https://stackoverflow.com/questions/33266815/r-data-table-how-to-replace-positive-values-with-column-names-across-multiple-bi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!