R splitting a column of character separated by different number of spaces

问题

I have a data frame with a column consisting of words separated by a varying number of spaces for example:

head(lst)
'fff fffd ddd'
'sss dd'
'de dd'
'dds sssd eew rrr'
'dsds eed'

What I would like to have is 2 columns: The first column is the part before the first space And the second column is the part after the last space meaning it should like this

V1       v2
'fff'   'ddd'
'sss'   'dd'
'de'    'dd'
'dds'   'rrr
'dsds'  'eed'

The first column I am able to get but the second one is a problem This is the code I use.

  lst <- strsplit(athletes.df$V1, "\\s+")
  v1 <- sapply(lst ,`[`, 1)
  v2 <- sapply(lst, `[`, 2)

What I get I get for column v2 is the second word. I know it's because I put 2 inside the sapply How do I tell it to only take what comes after the last space?

回答1:

You can use tail to grab the last entry of each vector:

lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst, head, 1) # example with head to grab first vector element
v2 <- sapply(lst, tail, 1) # example with tail to grab last vector element

Or perhaps the vapply version since you know your return type should be a character vector:

v2 <- vapply(lst, tail, 1, FUN.VALUE = character(1))

Another approach would be to modify your strsplit split criteria to something like this where you split on a space that can optionally be followed by any character one or more times until a final space is found.

strsplit(df$V1, "\\s(?:.+\\s)?")
#[[1]]
#[1] "fff" "ddd"
#
#[[2]]
#[1] "sss" "dd" 
#
#[[3]]
#[1] "de" "dd"
#
#[[4]]
#[1] "dds" "rrr"
#
#[[5]]
#[1] "dsds" "eed"

As Sumedh points out this regex works nicely with tidyr's separate:

tidyr::separate(df, V1, c("V1", "V2"), "\\s(?:.+\\s)?")
#    V1  V2
#1  fff ddd
#2  sss  dd
#3   de  dd
#4  dds rrr
#5 dsds eed

Two stringi based approaches:

library(stringi)
v1 <- stri_extract_last_regex(df$V1, "\\S+")
v2 <- stri_extract_first_regex(df$V1, "\\S+")

stri_extract_all_regex(df$V1, "^\\S+|\\S+$", simplify = TRUE)
# this variant explicitly checks for the spaces with lookarounds:
stri_extract_all_regex(df$V1, "^\\S+(?=\\s)|(?<=\\s)\\S+$", simplify = TRUE)

回答2:

Maybe this?

lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, function(x) x[length(x)])

data.frame(t(sapply(strsplit(athletes.df$V1, "\\s+"), 
                    function(x) c(x[1], x[length(x)]))))

回答3:

Without using any packages, this can be done with read.table after creating a delimiter using sub.

read.table(text=sub("^(\\S+)\\s+.*\\s+(\\S+)$", "\\1 \\2", df1$V1), 
                     header=FALSE, stringsAsFactors= FALSE)
#     V1  V2
#1  fff ddd
#2  sss  dd
#3   de  dd
#4  dds rrr
#5 dsds eed

Another convenient option is word from stringr

library(stringr)
transform(df1, V1 = word(V1, 1), V2 = word(V1, -1))
#   V1  V2
#1  fff ddd
#2  sss  dd
#3   de  dd
#4  dds rrr
#5 dsds eed

data

df1 <- structure(list(V1 = c("fff fffd ddd", "sss dd", "de dd",
"dds sssd eew rrr", 
"dsds eed")), .Names = "V1", class = "data.frame", row.names = c(NA, 
-5L))

来源：https://stackoverflow.com/questions/38669267/r-splitting-a-column-of-character-separated-by-different-number-of-spaces

标签

split