Regex Protein Digestion

牧云@^-^@ 提交于 2019-12-04 03:46:00

The simplest way to support this is to split on the zero-width lookahead:

s = "MTMDKPSQYDKIEAELQDICNDVLELLDSKG"
p s.split /(?=[BD])/
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

For understanding as to what was going wrong with your solution, let's look first at your regex versus one that works:

p s.scan(/.*?(?=[BD]|$)/)
#=> ["MTM", "", "KPSQY", "", "KIEAELQ", "", "ICN", "", "VLELL", "", "SKG", ""]

p s.scan(/.+?(?=[BD]|$)/)
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

The problem is that if you can capture zero characters and still match your zero-width lookahead, you succeed without advancing the scanning pointer. Let's look at a simpler-but-similar test case:

s = "abcd"
p s.scan //      # Match any position, without advancing
#=> ["", "", "", "", ""]

p s.scan /(?=.)/ # Anywhere that is followed by a character, without advancing
#=> ["", "", "", ""]

A naive implementation of String#scan might get stuck in an infinite loop, repeatedly matching with the pointer before the first character. It appears that once a match occurs without advancing the pointer the algorithm forcibly advances the pointer by one character. This explains the results in your case:

  1. First it matches all the characters up to a B or D,
  2. then it matches the zero-width position right before the B or D, without moving the character pointer,
  3. as a result the algorithm moves the pointer past the B or D, and continues on after that.

Basically, you want to cut you string before each B or D?

"...".split(/(?=[BD])/)

Gives you

["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!