Creating more complex regexes from TAG format

无人久伴 提交于 2020-01-06 14:07:57

问题


So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus).

I am starting with a string like this:

Arms_NNS folded_VVN ,_,

The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma).

The following regex is what I am trying to use, but it is not finding anything:

[\w-]+_(?:NN|NNS)\W+[\w-]+ _(?:VBG|VVN)\W+[\w-]+ _,

回答1:


Given the input string

Arms_NNS folded_VVN ,_,

the following regex

(\w+_(?:NN|NNS) \w+_(?:VBG|VVN) ,_,)

matches the whole string (and captures it - if you don't know what that means, that probably means it doesn't matter to you).

Given a longer string (which I made up)

Dog_NN Arms_NNS folded_VVN ,_, burp_VV

it still matches the part you want.

If the _VVN part is optional, you can use

(\w+_(?:NN|NNS) (?:\w+_(?:VBG|VVN) )?,_,)

which matches either witout, or with exactly one, word_VVN / word_VBG part.


Your more general questions:

I find it hard to explain how these things work. I'll try to explain the constituent parts:

  • \w matches word characters - characters you'd normally expect to find in words
  • \w* matches one-or-more of them
  • (NN|NNS) means "match NN or NNS"
  • ?: means "match but don't capture" - suggest googling what capturing means in relation to regexes.
  • ? alone means "match 0 or 1 of the thing before me - so x? would match "" or "x" but not "xx".
  • None of the characters in ,_, are special, so we can match them just by putting them in the regex.

One problem with your regex is that \w will not match a comma (only "word characters").

I don't know what [\w-] does. Looks a bit weird. I think it's probably not valid, but I don't know for sure.

My solution assumes there is exactly one space, and nothing else, between your tagged words.



来源:https://stackoverflow.com/questions/29829132/creating-more-complex-regexes-from-tag-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!