Convert column with pipe delimited data into dummy variables [duplicate]

拥有回忆 提交于 2019-11-27 03:31:24

问题


This question already has an answer here:

  • Split a column into multiple binary dummy columns [duplicate] 1 answer

I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.

For example:

Let's say we start with

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))

> df
              a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben

I'm interested in ending up with:

df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0), 
                 Steve = c(0, 0, 1))
> df2
  Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).

Any good ways to do this?


回答1:


Another way is using cSplit_e from splitstackshape package.

splitting the dataframe by column a and fill it by 0 and drop the original column.

library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)

#   a_Ben a_Chris a_Greg a_Jim a_Steve
#1     1       1      0     1       0
#2     1       0      1     1       0
#3     1       0      0     1       1



回答2:


Here is one option using dplyr and tidyr:

library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>% 
       mutate(a = strsplit(as.character(a), "\\|")) %>% 
       unnest() %>% table()

#    a
# id  Ben Chris Greg Jim Steve
#  1   1     1    0   1     0
#  2   1     0    1   1     0
#  3   1     0    0   1     1

The analogue in base R is:

df$a <- as.character(df$a)
s    <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))

Data:

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))



回答3:


We can use mtabulate from qdapTools after splitting the 'a' column

library(qdapTools)
mtabulate(strsplit(as.character(df$a), "|", fixed = TRUE))
#  Ben Chris Greg Jim Steve
#1   1     1    0   1     0
#2   1     0    1   1     0
#3   1     0    0   1     1



回答4:


Here is a method in base R

# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)

which returns

Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

The first line uses strsplit to produce a list of names split on the pipe "|", unlist and unique produce a vector of unique names. The second line runs through these names with lapply, and uses grepl to search for the names, which as.integer converts into binary integers. The returned list is converted into a data.frame and given column names with setNames.



来源:https://stackoverflow.com/questions/39461539/convert-column-with-pipe-delimited-data-into-dummy-variables

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!