splitting a column delimiter R

断了今生、忘了曾经 提交于 2020-01-25 15:12:26

问题


I have a dataframe as below. I want to split the last column into 2. Splitting needs to be done based upon the only first : and rest of the columns dont matter.

In the new dataframe, there will be 4 columns. 3 rd column will be (a,b,d) while 4th column will be (1,2:3,3:4:4)

any suggestions? 4th line of my code doesnt work :(. I am okay with completely new solution or corrections to the line 4

employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))

as.data.frame(do.call(rbind, strsplit(df,":")))

--------------------update1

Below solutions work well. But i need a modified solution as I just realized that some of the cells in column 3 wont have ":". In such case i want text in that cell to appear in only 1st column after splitting that column

employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b", "d: 3:4:4"))

回答1:


You could use cSplit. On your updated data frame,

library(splitstackshape)
cSplit(df, "originalColumn", sep = ":{1}")
#      employee salary originalColumn_1 originalColumn_2
# 1:   John Doe      3                a                1
# 2: Peter Gynn      2                b               NA
# 3: Jolie Hope      1                d            3:4:4

And on your original data frame,

df1 <- data.frame(employee, salary,   
                  originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
cSplit(df1, "originalColumn", sep = ":{1}")
#      employee salary originalColumn_1 originalColumn_2
# 1:   John Doe      3                a                1
# 2: Peter Gynn      2                b              2:3
# 3: Jolie Hope      1                d            3:4:4

Note: I'm using splitstackshape version 1.4.2. I believe the sep argument has been changed from version 1.4.0




回答2:


You could use extract from tidyr to split the originalColumn in to two columns. In the below code, I am creating 3 columns and removing one of the unwanted columns from the result.

 library(tidyr)
 pat  <- "([^ :])( ?:|: ?|)(.*)"
 extract(df, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
 #     employee salary Col1  Col2
 #1   John Doe      3    a     1
 #2 Peter Gynn      2    b   2:3
 #3 Jolie Hope      1    d 3:4:4

Using the updated df, (for better identification - df1)

 extract(df1, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
 #    employee salary Col1  Col2
 #1   John Doe      3    a     1
 #2 Peter Gynn      2    b      
 #3 Jolie Hope      1    d 3:4:4

Or without creating a new column in df

 extract(df, originalColumn, c("Col1", "Col2"), "(.)[ :](.*)") %>%
                                  mutate(Col2= gsub("^\\:", "", Col2))

 #   employee salary Col1   Col2
 #1   John Doe      3    a      1
 #2 Peter Gynn      2    b    2:3
 #3 Jolie Hope      1    d  3:4:4

Based on the pattern in df, the below code also works. Here, the regex used to extract the first column is (.). A dot is a single element at the beginning of the string inside the parentheses will be extracted for the Col1. Then .{2} two elements following the first are discarded and the rest within the parentheses (.*) forms the Col2.

extract(df, originalColumn, c("Col1", "Col2"), "(.).{2}(.*)")
#    employee salary Col1  Col2
#1   John Doe      3    a     1
#2 Peter Gynn      2    b   2:3
#3 Jolie Hope      1    d 3:4:4

or using strsplit

as.data.frame(do.call(rbind, strsplit(as.character(df$originalColumn), " :|: ")))
#   V1    V2
#1  a     1
#2  b   2:3
#3  d 3:4:4

For df1, here is a solution using strsplit

  lst <- strsplit(as.character(df1$originalColumn), " :|: ")
  as.data.frame(do.call(rbind,lapply(lst,  
          `length<-`, max(sapply(lst, length)))) )
  #  V1    V2
 #1  a     1
 #2  b  <NA>
 #3  d 3:4:4



回答3:


You were close, here's a solution:

library(stringr)

df[, c('Col1','Col2')] <- do.call(rbind, str_split_fixed(df$originalColumn,":",n=2))
df$originalColumn <- NULL

    employee salary Col1 Col2
1   John Doe      3  a       1
2 Peter Gynn      2  b     2:3
3 Jolie Hope      1   d  3:4:4

Notes:

  • stringr::str_split() is better than base::strsplit() because you don't have to do as.character(), also it has the n=2 argument you want to limit to only split on the first ':'


来源:https://stackoverflow.com/questions/26637864/splitting-a-column-delimiter-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!