问题
I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
- Indicator descriptions are in one column
- Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
- I have also tried the default separation method
sep = "[^[:alnum:]]"
but it breaks down the column into too many columns as it appears to be matching all of the available digits. - The
sep = "2*"
also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
- Identifying the first digit in the string
- Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
回答1:
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z])
is a positive lookbehind - it asserts that[a-z]
(match a single character present in the range between a and z (case sensitive)) can be matched?
matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed(?=[0-9])
is a positive lookahead - it asserts that[0-9]
(match a single character present in the range between 0 and 9) can be matched
回答2:
You could also use unglue::unnest()
:
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)
来源:https://stackoverflow.com/questions/34842528/separating-column-using-separate-tidyr-via-dplyr-on-a-first-encountered-digit