exclusively apply java pattern matcher to extract html elements, ignore some characters

爷,独闯天下 提交于 2019-12-13 08:23:43

问题


I'm using this code:

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>");
Matcher mat_1 = pat_1.matcher( text );
while( mat_1.find() )
{
    System.out.println( mat_1.group(1) );
}

This is the input data source bring matched:

<br>
<span class=""b"">拼音:</span><span class=""pinyin"">xī<script>Setduyin('Duyin/xi1')</script></span> <span class=""b"">注音:</span><span class=""pinyin"">ㄒㄧ<script>Setduyin('Duyin/xi1')</script></span><br>
<span class=""b"">简体部首:</span>丨 <span class=""b"">部首笔画:</span>1 <span class=""b"">总笔画:</span>8<br><span class=""b"">繁体部首:</span>卜 <span class=""b"">部首笔画:</span>2 <span class=""b"">总笔画:</span>8<br><span class=""b"">康熙字典笔画</span>( 卥:8; )

The problem with my code is that it also picks up ㄒㄧ because the preceding and proceding elements are identical. How could I exclude ㄒㄧ and only select . maybe I can use the <br> tag because that is something unique to the first once, but that necessitates identifying a new line and also ignoring 拼音: how to do that? I've been playing around with regex101.com but I've not yet been able to pin it down.

So to be clear right now the output of that java code is

xī
ㄒㄧ

but I want it only to be


回答1:


You could try the below regex.

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>(?:(?!<script>).)*");

DEMO

OR

(?m)^.*?class=\"\"pinyin\"\">(.*?)<script>

(?m) called multiline modifier, it's safe to enable this modifier when anchors ^, $ are used in the regex.

DEMO



来源:https://stackoverflow.com/questions/28471220/exclusively-apply-java-pattern-matcher-to-extract-html-elements-ignore-some-cha

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!