问题
I have a data frame with a column consisting of words separated by a varying number of spaces for example:
head(lst)
'fff fffd ddd'
'sss dd'
'de dd'
'dds sssd eew rrr'
'dsds eed'
What I would like to have is 2 columns: The first column is the part before the first space And the second column is the part after the last space meaning it should like this
V1 v2
'fff' 'ddd'
'sss' 'dd'
'de' 'dd'
'dds' 'rrr
'dsds' 'eed'
The first column I am able to get but the second one is a problem This is the code I use.
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, `[`, 2)
What I get I get for column v2 is the second word. I know it's because I put 2 inside the sapply How do I tell it to only take what comes after the last space?
回答1:
You can use tail
to grab the last entry of each vector:
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst, head, 1) # example with head to grab first vector element
v2 <- sapply(lst, tail, 1) # example with tail to grab last vector element
Or perhaps the vapply
version since you know your return type should be a character vector:
v2 <- vapply(lst, tail, 1, FUN.VALUE = character(1))
Another approach would be to modify your strsplit
split criteria to something like this where you split on a space that can optionally be followed by any character one or more times until a final space is found.
strsplit(df$V1, "\\s(?:.+\\s)?")
#[[1]]
#[1] "fff" "ddd"
#
#[[2]]
#[1] "sss" "dd"
#
#[[3]]
#[1] "de" "dd"
#
#[[4]]
#[1] "dds" "rrr"
#
#[[5]]
#[1] "dsds" "eed"
As Sumedh points out this regex works nicely with tidyr
's separate
:
tidyr::separate(df, V1, c("V1", "V2"), "\\s(?:.+\\s)?")
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Two stringi
based approaches:
library(stringi)
v1 <- stri_extract_last_regex(df$V1, "\\S+")
v2 <- stri_extract_first_regex(df$V1, "\\S+")
Or
stri_extract_all_regex(df$V1, "^\\S+|\\S+$", simplify = TRUE)
# this variant explicitly checks for the spaces with lookarounds:
stri_extract_all_regex(df$V1, "^\\S+(?=\\s)|(?<=\\s)\\S+$", simplify = TRUE)
回答2:
Maybe this?
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, function(x) x[length(x)])
Or
data.frame(t(sapply(strsplit(athletes.df$V1, "\\s+"),
function(x) c(x[1], x[length(x)]))))
回答3:
Without using any packages, this can be done with read.table
after creating a delimiter using sub
.
read.table(text=sub("^(\\S+)\\s+.*\\s+(\\S+)$", "\\1 \\2", df1$V1),
header=FALSE, stringsAsFactors= FALSE)
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Another convenient option is word
from stringr
library(stringr)
transform(df1, V1 = word(V1, 1), V2 = word(V1, -1))
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
data
df1 <- structure(list(V1 = c("fff fffd ddd", "sss dd", "de dd",
"dds sssd eew rrr",
"dsds eed")), .Names = "V1", class = "data.frame", row.names = c(NA,
-5L))
来源:https://stackoverflow.com/questions/38669267/r-splitting-a-column-of-character-separated-by-different-number-of-spaces