Concatenate alternate characters from different columns in R programming

送分小仙女□ 提交于 2020-01-16 15:25:43

问题


I have a df with 2 columns. I need to combine Col1 and Col2 in Col3 - alternate text separated by ">" a1-b1;a2-b2;a3-b3;...

Example

|      Col1       |           Col2   |            Col3              |

| abcd > de > efg | ppppp > ppt > pp | abcd-ppppp > de-ppt > efg-pp | 

| hij > kl > iiii | aaa > bbb > hhh  | hij-aaa > kl-bbb > iiii-hhh  | 

| aa              | fff              | aa-fff                       | 

| a > bbb         |  pp > a          | a-pp > bbb-a                 | 

....

How can I do that in R programming? Thanks


回答1:


This was a pain in the ass to solve. In the future, for our sanity please consider how you output your data. This could have been easily solved if, however the data was generated, you consider downstream analysis. Anyway enough whinging here is the solution.

Lets generate your data:

Col1 <- c("abcd > de > efg", "hij > kl > iiii", "aa", "a > bbb")
Col2 <- c("ppppp > ppt > pp", "aaa > bbb > hhh", "fff", "pp > a")
dat <- data.frame(Col1, Col2, stringsAsFactors = FALSE)

Next using apply we strip, separate and flatten Col1 and Col2 and add the first separator -:

l1 <- apply(dat, 2, function(x) trimws(unlist(strsplit(x, split = ">"))))
l2 <- apply(l1, 1, function(x) paste0(x[1], "-", x[2]))

The next part was surprisingly difficult, after much googling I found a solution (a hack) to split a list of characters by a numeric vector.

#thanks: https://techoverflow.net/2012/11/10/r-count-occurrences-of-character-in-string/
#gets occurrences of ">" for later use
countCharOccurrences <- function(char, s) {
  s2 <- gsub(char,"",s)
  return (nchar(s) - nchar(s2))
}

o <- countCharOccurrences(">", dat$Col1)+1
df <- as.data.frame(l2, stringsAsFactors = FALSE)

Split df by the occurrences of ">" (i.e the values of o):

# Thanks to this SO answer:
# https://stackoverflow.com/questions/27132290/split-dataframe-by-row-number-in-r
l2a <- split(df, cumsum(c(TRUE,(1:nrow(df) %in% cumsum(o))[-nrow(df)])))

Finally, we collapse list of dataframes and add the final separator >:

l3 <- lapply(l2a, function(x) paste(x[,1], collapse = " > "))

Then combine with your starting dataframe:

dat$Col3 <- l3

             Col1             Col2                         Col3
1 abcd > de > efg ppppp > ppt > pp abcd-ppppp > de-ppt > efg-pp
2 hij > kl > iiii  aaa > bbb > hhh  hij-aaa > kl-bbb > iiii-hhh
3              aa              fff                       aa-fff
4         a > bbb           pp > a                 a-pp > bbb-a

Tada!

edit: I had forgotten l3 is a list of objects. You need to use unlist to flatten them like this:

dat$Col3 <- unlist(l3)


来源:https://stackoverflow.com/questions/49910473/concatenate-alternate-characters-from-different-columns-in-r-programming

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!