问题
am trying to parse a multi-line html file using regex.
HTML code:
<td>Details</td></tr>
<tr class=d1>
<td>uss_vod_translator</td>
Regex Expression:
if ($line =~ m/Details<\/td>\s*<\/tr>\s*<tr\s*class=d1>\s*<td>(\w*)<\/td>/)
{
print "$1";
}
I am using /s*
(space) for multi-line, but it is not working. I searched about it, even used /\?
for multi-line but that too did not work.
Can any one please suggest me how to parse a multiline HTML?
I know regex is a poor solution to parse HTML. But i have a legacy HTML code which i need to parse and have no other choice.
回答1:
Can any one please suggest me how to parse a multiline HTML?
Stop trying to use regular expressions and use a module that will parse it for you.
HTML::TreeBuilder is a good solution.
HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.
HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.
回答2:
As stated above Never use regexes to parse HTML.
I'm using HTML::TreeBuilder::XPath to parse HTML and this dramatically decrease creation time for each of my scraping/parsing programs.
Here is how you task could be implemented:
use Modern::Perl;
use HTML::TreeBuilder::XPath;
my $html = <<END;
<tr><td>General Info</td></tr>
<tr class=d1>
<td>some info</td></tr>
<tr><td>Details</td></tr>
<tr class=d1>
<td>uss_vod_translator</td></tr>
<tr><td>Another header</td></tr>
<tr class=d1>
<td>some other info</td></tr>
END
my $tree = HTML::TreeBuilder::XPath->new_from_content($html);
my ($details) = $tree->findvalues('//tr[ td[ text() = "Details" ] ]/following-sibling::tr[1]/td[1]');
say $details;
回答3:
Try the below line before you match your pattern
$line=~s/>(\n|\t|\s)+</></gs;
Then you can made the HTML string as in single line.
来源:https://stackoverflow.com/questions/13249392/regex-to-parse-a-multiline-html