Ordering a complex string vector in order to obtain a ordered factor

半世苍凉 提交于 2020-01-11 07:46:55

问题


I'm working with a string vector with a structure corresponding to the one below:

messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")

I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:

messy_vec_fac <- as.factor(messy_vec)

would produce

> messy_vec_fac
[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80

whereas I'm interested in obtaining vector of characteristics:

[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80

Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

As indicated, the order of levels corresponds to the order:

0 21 50 70 100

which are the first digits derived from the elements of the messy vector.

Side points

This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:

  • 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
  • 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
  • It can be safely assumed that all vector elements have - strings: [[:space:]]-[[:space:]]
  • Duplicate values occur

Edits

Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr syntax

# ... %>%
  mutate(very_needed_factor= factor(messy_vec,
                                      levels = messy_vec[
                                        order(
                                          as.numeric(
                                            sub("(\\d+)[^\\d]* - .*", "\\1",
                                                messy_vec)))]))
# %>% ...

But I keep on getting the following error:

Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14",  :
  NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

回答1:


If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with sub and convert them to numeric to be then used to order the levels in the factor call.

num_vec <- as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", messy_vec))
messy_vec_fac <- factor(messy_vec, levels=messy_vec[order(num_vec)])

messy_vec_fac
#[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
#Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

NB: in case of duplicated values, you can do levels=unique(messy_vec[order(num_vec)]) in the factor call




回答2:


Here is another solution

library(magrittr)    
messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")
ints <- strsplit(messy_vec, "-") %>% 
  unlist() %>% 
  gsub(pattern = "([[:space:]]|[[:alpha:]])*", replacement = "") %>% 
  as.integer() %>% 
  matrix(nrow = 2)
factor(messy_vec, levels = messy_vec[order(ints[1, ], ints[2, ])])


来源:https://stackoverflow.com/questions/33522278/ordering-a-complex-string-vector-in-order-to-obtain-a-ordered-factor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!