I have a vector of character data. Most of the elements in the vector consist of one or more letters followed by one or more numbers. I wish to split each element in the
Late answer, but another option is to use strsplit
with a regex pattern which uses lookarounds to find the boundary between numbers and letters:
var <- "ABC123"
strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
[[1]]
[1] "ABC" "123"
The above pattern will match (but not consume) when either the previous character is a letter and the following character is a number, or vice-versa. Note that we use strsplit
in Perl mode to access lookarounds.
Demo
For your regex you have to use:
gsub("[[:digit:]]","",my.data)
The [:digit:]
character class only makes sense inside a set of []
.
You can also use colsplit
from reshape2
to split your vector into character and digit columns in one step:
library(reshape2)
colsplit(my.data, "(?<=\\p{L})(?=[\\d+$])", c("char", "digit"))
Result:
char digit
1 aaa NA
2 b 11
3 b 21
4 b 101
5 b 111
6 ccc 1
7 ffffd 1
8 ccc 20
9 ffffd 13
Data:
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ffffd1", "ccc20", "ffffd13")
With stringr
, if you like (and slightly different from the answer to the other question):
# load library
library(stringr)
#
# load data
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ffffd1", "ccc20", "ffffd13")
#
# extract numbers only
my.data.num <- as.numeric(str_extract(my.data, "[0-9]+"))
#
# check output
my.data.num
[1] NA 11 21 101 111 1 1 20 13
#
# extract characters only
my.data.cha <- (str_extract(my.data, "[aA-zZ]+"))
#
# check output
my.data.cha
[1] "aaa" "b" "b" "b" "b" "ccc" "ffffd" "ccc" "ffffd"
mydata.nub<-gsub("\ \ D","",my.data)
mydata.text<-gsub("\ \ d","",my.data)
This one is perfect, and it also separates number and text, even if there is number between the text.
Since none of the previous answers use tidyr::separate
here it goes:
library(tidyr)
df <- data.frame(mycol = c("APPLE348744", "BANANA77845", "OATS2647892", "EGG98586456"))
df %>%
separate(mycol,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])"
)