How to split a character vector based on a numeric vector for positions

大兔子大兔子 提交于 2019-12-12 16:41:52

问题


I would like to split a character vector into substrings based on a second numeric vector for the splitting points

vec <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"
split.points <- c(25, 32, 55, 90, 124)

I would like to cut the above character vector at the positions given in the split.points vector into six different substrings.

It sounds very simple, but the split command I know works either only with specific regex (patterns) or with a set length of substrings.

I would appreciate any help.


回答1:


We can try substring:

substring(
    vec,
    c(1, split.points + 1),
    c(split.points, nchar(vec))
)
# [1] "LAYRVCMTNEGHPWVSLVVQKTRLQ"                    "ISQDPSL"                                     
# [3] "NYEYLPTMGLKSFIQASLALLFG"                      "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"         
# [5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"           "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"



回答2:


Another alternative is to use read.fwf:

unlist(read.fwf(textConnection(vec), 
                widths = c(25, diff(split.points)), 
                as.is = TRUE), 
       use.names = FALSE)

which gives:

[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ"          
[2] "ISQDPSL"                            
[3] "NYEYLPTMGLKSFIQASLALLFG"            
[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"
[5] "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"

I wouldn't be surprised when your character vector originates from a data-file. In that case read.fwf would be especially usefull. An example:

vec2 <- "LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM
LAYRVCMTNEGHPWVSLVVQKTRLQISQDPSLNYEYLPTMGLKSFIQASLALLFGKHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHKDARIVYIISSQKELHGLVFQDMGFTVYEYSVWDPKKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"

read.fwf(textConnection(vec2), 
         widths = c(25, diff(split.points)), 
         as.is=TRUE)

which will give:

                         V1      V2                      V3                                  V4                                 V5
1 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP
2 LAYRVCMTNEGHPWVSLVVQKTRLQ ISQDPSL NYEYLPTMGLKSFIQASLALLFG KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP



回答3:


We can use separate from tidyr

library(tidyverse)
data_frame(vec) %>%
      separate(vec, into = paste0('vec', 1:6), sep = split.points) %>% 
      unlist(., use.names = FALSE)
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ"                    "ISQDPSL"                                      "NYEYLPTMGLKSFIQASLALLFG"                     
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"          "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"
#[6] "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"

A base R option would be substr

unname(mapply(substr, vec, start = c(1, split.points+1), stop = c(split.points, nchar(vec))))
#[1] "LAYRVCMTNEGHPWVSLVVQKTRLQ"                    "ISQDPSL"                                      "NYEYLPTMGLKSFIQASLALLFG"                     
#[4] "KHSQAIVENRVGGVHTVGDSGAFQLGVQFLRAWHK"          "DARIVYIISSQKELHGLVFQDMGFTVYEYSVWDP"           "KKLCMDPDILLNVVEQIPHGCVLVMGNIIDCKLTPSGWAKLMSM"


来源:https://stackoverflow.com/questions/44262618/how-to-split-a-character-vector-based-on-a-numeric-vector-for-positions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!