Convert column with pipe delimited data into dummy variables [duplicate]

问题

This question already has an answer here:

Split a column into multiple binary dummy columns [duplicate] 1 answer

I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.

For example:

Let's say we start with

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))

> df
              a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben

I'm interested in ending up with:

df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0), 
                 Steve = c(0, 0, 1))
> df2
  Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).

Any good ways to do this?

回答1:

Another way is using cSplit_e from splitstackshape package.

splitting the dataframe by column a and fill it by 0 and drop the original column.

library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)

#   a_Ben a_Chris a_Greg a_Jim a_Steve
#1     1       1      0     1       0
#2     1       0      1     1       0
#3     1       0      0     1       1

回答2:

Here is one option using dplyr and tidyr:

library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>% 
       mutate(a = strsplit(as.character(a), "\\|")) %>% 
       unnest() %>% table()

#    a
# id  Ben Chris Greg Jim Steve
#  1   1     1    0   1     0
#  2   1     0    1   1     0
#  3   1     0    0   1     1

The analogue in base R is:

df$a <- as.character(df$a)
s    <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))

Data:

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))

回答3:

We can use mtabulate from qdapTools after splitting the 'a' column

library(qdapTools)
mtabulate(strsplit(as.character(df$a), "|", fixed = TRUE))
#  Ben Chris Greg Jim Steve
#1   1     1    0   1     0
#2   1     0    1   1     0
#3   1     0    0   1     1

回答4:

Here is a method in base R

# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)

which returns

Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

The first line uses strsplit to produce a list of names split on the pipe "|", unlist and unique produce a vector of unique names. The second line runs through these names with lapply, and uses grepl to search for the names, which as.integer converts into binary integers. The returned list is converted into a data.frame and given column names with setNames.

来源：https://stackoverflow.com/questions/39461539/convert-column-with-pipe-delimited-data-into-dummy-variables

标签

delimiter