Intelligent pattern matching in string

拈花ヽ惹草 提交于 2019-12-05 11:50:42
Marc Tanti

What you are looking for is called grammar induction and it works but making a program figure out a regular expression (or some other type of pattern) that matches certain strings but not others. You have to give it the strings yourself however, called a training set, with positive examples (strings that should be matched) and negative examples (strings that shouldn't be matched).

An interesting technique is called boosting where you learn a lot of simple patterns which are precise (do not match negative examples) but match only a few positive examples; however when combined together will match a large amount of positive examples.

Since you want to extract substrings rather than just match strings, the way I would go about it is to take prefixes of the file names and try to match those. In this way you'd know where the substring starts. Here's an example:

Positives:
[MAS] 
[Leopard-Raws] 
[BLAST] 
[sage]_

Negatives:
[MAS] H
[Leopard-Raws] Akat
[BL
[sage]_Mobile_Suit_Gundam_AGE_

If done correctly, you should obtain a regular expression which you can use on prefixes of the file names. By growing the prefix one letter at a time you can know where the content of interest starts. Like this:

[ False
[s False
[sa False
[sag False
[sage False
[sage] True
[sage]_ True
[sage]_M False

What happened here is that I increased the prefix of the file name one character at a time until the regular expression I learnt matched it. But I also wanted to find the longest prefix that matches (because otherwise I would have missed the underscore since [sage] is an acceptable prefix as well) so I continued moving forward until the regular expression stopped matching. In this way I would know that the prefix before the actual content starts is "[sage]_". You can do the same for matching where it ends as well by using prefixes which include the content of interest.

To learn about regular expression learning see this post. Keep in mind that automated learning will never be perfect but the more examples you use the more accurate it will be.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!