Create new data frame columns with binary (0/1) data based on text strings in existing column in R

问题

I have an R data frame that looks like:

ID          YR      SC
ABX:22798   1976    A's Frnd; Cat; Cat & Mse
ABX:23798   1983    A's Frnd; Cat; Zebra Fish
ABX:22498   2010    Zebra Fish
ABX:22728   2010    Bear; Dog; Zebra Fish
ABX:22228   2011    Bear

example data:

df <- structure(list(ID = c("ABX:22798", "ABX:23798", "ABX:22498", "ABX:22728", "ABX:22228"), YR = c(1976, 1983, 2010, 2010, 2011), SC = c("A's Frnd; Cat; Cat & Mse", "A's Frnd; Cat; Zebra Fish", "Zebra Fish", "Bear; Dog; Zebra Fish", "Bear")), .Names = c("ID", "YR", "SC"), row.names = c(NA, 5L), class = "data.frame")

That I would like to transform by splitting the text string in the SC column by "; ". Then, I'd like to use the resulting lists of strings to populate new columns with binary data. The final data frame would look like this:

ID          YR   A's Frnd   Bear    Cat  Cat & Mse  Dog Zebra Fish
ABX:22798   1976    1       0       1    1          0   0
ABX:23798   1983    1       0       1    0          0   1
ABX:22498   2010    0       0       0    0          0   1
ABX:22728   2010    0       1       0    0          1   1
ABX:22228   2011    0       1       0    0          0   0

I'll be analyzing a number of different datasets individually. In any given data set, there are between about 100 and 230 unique SCs entries, and the number of rows per set ranges from about 500 to several thousand. The number of SCs per row ranges from 1 to about 6 or so.

I have had a couple of starts with this, most are quite ugly. I thought the approach below looked promising (it's similar to a python pandas implementation that works well). It would be great to learn a good way to do this in R!

My starter code:

# Get list of unique SCs
SCs <- df[,2]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically, 
#     note that apostrophes could be a problem  
SCunique <- sort(SCunique)
# create a dataframe of 0s to add to the original df
df0 <- as.data.frame(matrix(0, ncol=length(SCunique), nrow=nrow(df)))
colnames(df0) <- SCunique

...(and then...?)

I've found similar questions/answers, including:

Dummy variables from a string variable
Split strings into columns in R where each string has a potentially different number of column entries

Edit: Found one more answer set of interest: Improve text processing speed using R and data.table

Thanks in advance for your answers.

回答1:

I think this should do what you're looking for:

library(reshape2)
a <- strsplit(as.character(df$SC), "; ", fixed = TRUE)
dfL <- merge(df, melt(a), by.x = "row.names", by.y = "L1", all = TRUE)
dcast(dfL, ID + YR ~ value, value.var="value", fun.aggregate=length, fill = 0)
#          ID   YR A's Frnd Cat Cat & Mse Zebra Fish Bear Dog
# 1 ABX:22228 2011        0   0         0          0    1   0
# 2 ABX:22498 2010        0   0         0          1    0   0
# 3 ABX:22728 2010        0   0         0          1    1   1
# 4 ABX:22798 1976        1   1         1          0    0   0
# 5 ABX:23798 1983        1   1         0          1    0   0

Factor the "value" column first if the column order is important to you.

I also have a package called "splitstackshape" that would work nicely with this if your data didn't have quotes in it. That's a bug that I'm looking into. Once it is resolved, it should be possible to do a command like:

library(reshape2)
library(splitstackshape)
dcast(concat.split.multiple(df, "SC", ";", "long"), 
      ID + YR ~ SC, value.var="SC", fun.aggregate=length, fill=0)

to get what you need.

回答2:

Continuing on your logic, you can do something like:

##your code
# Get list of unique SCs
SCs <- df[,3]  #NOTE: here, you have df[,2], but I guess should have been df[,3]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically, 
#     note that apostrophes could be a problem  
SCunique <- sort(SCunique)
##

df0 <- sapply(SCunique, function(x) as.integer(grepl(x, SCs)))
df0
#     A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
#[1,]        1    0   1         1   0          0
#[2,]        1    0   1         0   0          1
#[3,]        0    0   0         0   0          1
#[4,]        0    1   0         0   1          1
#[5,]        0    1   0         0   0          0

来源：https://stackoverflow.com/questions/20628038/create-new-data-frame-columns-with-binary-0-1-data-based-on-text-strings-in-ex

标签

string

dataframe

multiple-columns