HTML::PullParser splits up text element randomly

大兔子大兔子 提交于 2019-12-07 16:36:06

问题


I'm using Perl module HTML::PullParser. I noticed that it sometimes splits up a text element (as far as I can tell) randomly.

For example, if I have a html file test.html with the content of

<html>
...
<FONT STYLE="font-family:Times New Roman" SIZE="2">THE QUICK BROWN FOX</FONT>
...
</html>

And my perl code looks something like

my $html = HTML::PullParser->new(file => 'test.html', text => '"T", text');
while (my $token = $html->get_token) {
    print "$$token[1]\n";
}

Then sometimes I get back

THE QUICK BROWN FOX    # correctly parsed

But other times I get

THE QUICK
 BROWN FOX

where the text element is parsed into two separate tokens. Yet at other times, depending on the other content of the html file, I get

THE QUICK BROWN
 FOX

where the breaking point is different. This behavior is extremely annoying. And I tried my best to isolate the problem. Looks like it is dependent on the entirety of the file (i.e. if I delete the rest of the file to have only that element left, then it is fine). However, I'm not able to identify what part of the rest of the file caused this. Wondering if anyone had similar experience and know how to get around the issue? Thx!!

UPDATE: the occurrence of this errant behavior is also NOT dependent on a single section of html code elsewhere in the file. I was able to isolate two sections of html codes prior to that text element - when both of them are present, this error occurs. But when either one is present without the other, this problem goes away... I'm absolutely confused and annoyed.


回答1:


HTML::PullParser is a subclass of HTML::Parser. HTML::Parser has an unbroken_text attribute that controls whether it spits out text events as soon as possible, or whether it buffers text up until the parser knows that no more text is coming. The default is to generate text nodes as soon as possible. a $p->unbroken_text(1) call should make it buffer :)



来源:https://stackoverflow.com/questions/7069923/htmlpullparser-splits-up-text-element-randomly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!