split a dataframe column by regular expression on characters separated by a “.”

问题

In R, I have the following dataframe:

     Name     Category
1    Beans    1.12.5
2    Pears    5.7.9
3    Eggs     10.6.5

What I would like to have is the following:

     Name     Cat1    Cat2    Cat3
1    Beans    1       12      5
2    Pears    5       7       9
3    Eggs     10      6       5

Ideally some expression built inside plyr would be nice...

I will investigate on my side but as searching this might take me a lot of time I was just wondering if some of you do have some hints to perform this...

回答1:

I've written a function concat.split (a "family" of functions, actually) as part of my splitstackshape package for dealing with these types of problems:

# install.packages("splitstackshape")
library(splitstackshape)
concat.split(mydf, "Category", ".", drop=TRUE)
#    Name Category_1 Category_2 Category_3
# 1 Beans          1         12          5
# 2 Pears          5          7          9
# 3  Eggs         10          6          5

It also works nicely on "unbalanced" data.

dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), 
                  Category = c("1.12.5", "5.7.9.8", "10.6.5.7.7"))
concat.split(dat, "Category", ".", drop = TRUE)
#    Name Category_1 Category_2 Category_3 Category_4 Category_5
# 1 Beans          1         12          5         NA         NA
# 2 Pears          5          7          9          8         NA
# 3  Eggs         10          6          5          7          7

Because "long" or "molten" data are often required in these types of situations, the concat.split.multiple function has a "long" argument too:

concat.split.multiple(dat, "Category", ".", direction = "long")
#     Name time Category
# 1  Beans    1        1
# 2  Pears    1        5
# 3   Eggs    1       10
# 4  Beans    2       12
# 5  Pears    2        7
# 6   Eggs    2        6
# 7  Beans    3        5
# 8  Pears    3        9
# 9   Eggs    3        5
# 10 Beans    4       NA
# 11 Pears    4        8
# 12  Eggs    4        7
# 13 Beans    5       NA
# 14 Pears    5       NA
# 15  Eggs    5        7

回答2:

The qdap package has the colsplit2df for just these sort of situations:

#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5", 
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)

library(qdap)
colsplit2df(dat, 2, paste0("cat", 1:3))

## > colsplit2df(dat, 2, paste0("cat", 1:3))
##    Name cat1 cat2 cat3
## 1 Beans    1   12    5
## 2 Pears    5    7    9
## 3  Eggs   10    6    5

回答3:

If you have a consistent number of categories, then this will work:

#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5", 
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)

spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))

Result:

dat
   Name Category cat1 cat2 cat3
1 Beans   1.12.5    1   12    5
2 Pears    5.7.9    5    7    9
3  Eggs   10.6.5   10    6    5

If you have differing numbers of separated values, then this should account for it:

#example unbalanced data
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5", 
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
dat$Category[2] <- "5.7"

spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
spl <- Map(function(x,y) c(x,rep(NA,max(len)-y)), spl, len)

dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))

Result:

   Name Category cat1 cat2 cat3
1 Beans   1.12.5    1   12    5
2 Pears      5.7    5    7   NA
3  Eggs   10.6.5   10    6    5

来源：https://stackoverflow.com/questions/18525097/split-a-dataframe-column-by-regular-expression-on-characters-separated-by-a

标签

regex

plyr