tidyr separate column values into character and numeric using regex

别等时光非礼了梦想. 提交于 2019-12-01 17:55:00

You may use a (?<=[a-z])(?=[0-9]) lookaround based regex with tidyr::separate:

> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The (?<=[a-z])(?=[0-9]) pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])) and a digit ((?=[0-9])). The (?<=...) is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...) is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.

Alternatively, you may use extract:

extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")

Output:

    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The ^([a-z]+)(\\d+)$ pattern matches:

  • ^ - start of input
  • ([a-z]+) - Capturing group 1 (column name): one or more lowercase ASCII letters
  • (\\d+) - Capturing group 2 (column value): one or more digits
  • $ - end of string.

You can add one more step If you really want to get it with separate, in which I don't see the point, i.e. (Using the same regex as @ WiktorStribiżew),

df %>% 
  mutate(A = gsub('^([a-z]+)(\\d+)$', '\\1_\\2', A)) %>% 
  separate(A, into = c('name', 'value'), sep = '_')

For a bare R version without a lookaround-based regex, define the regular expression first:

> re <- "[a-zA-Z][0-9]"

Then use two substr() commands to separate and return the desired two components, before and after the matched pattern.

> with(df,
      data.frame(name=substr(A, 1L, regexpr(re, A)), 
                 value=substr(A, regexpr(re, A) + 1L, 1000L))
      )
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The regex here looks for the pattern "any alpha" [a-zA-Z] followed by "any numeric" [0-9]. I believe this is what the reshape command does if the sep argument is specified as "".

You could use the package unglue

library(unglue)
unglue_unnest(df, A, "{name=\\D+}{value}")
#>     name value
#> 1    enc     0
#> 2    enc    10
#> 3    enc    25
#> 4    enc   100
#> 5  harab     0
#> 6  harab    25
#> 7  harab   100
#> 8  requi     0
#> 9  requi    25
#> 10 requi   100

Created on 2019-10-08 by the reprex package (v0.3.0)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!