R extract a part of a string in R

微笑、不失礼 提交于 2020-01-05 04:02:22

问题


I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.

The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..

I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.

 nm = c(
  "probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
  "probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
  "probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
  "probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
  "probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
  "probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")

extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)

Output

1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391

回答1:


Using regular expressions:

sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)

A bit of explanation:

  1. . means "any character".
  2. .* means "any number of characters".
  3. .*? means "any number of characters, but do not be greedy.
  4. patterns within parenthesis are captured and assigned to \\1, \\2, etc.
  5. $ means end of the line (or string).

So here, the pattern matches the whole line, and captures two things via the two (.*?): the HG-Focus (or other) thing you don't want as \\1 and your id as \\2. By setting the replacement to \\2, we are effectively replacing the whole string with your id.

I now realize it was not necessary to capture the first thing, so this would work just as well:

sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)



回答2:


A roundabout technique:

sapply(strsplit(sapply(strsplit(nm, "e:"), "[[", 2), ";"), "[[", 1)


来源:https://stackoverflow.com/questions/12443278/r-extract-a-part-of-a-string-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!