How can I parse only part of an HTML file and ignore the rest?

两盒软妹~` 提交于 2020-01-06 08:38:11

问题


In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

</p><h1>dataset 1:</h1>

&nbsp;<table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>  
<td><strong>name:</strong>&nbsp;</td>  <td width=500> myname one         </td></tr><tr>  
<td><strong>type:</strong>&nbsp;</td>  <td width=500>       type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;73430&nbsp;Madrid</td></tr><tr>  
<td><strong>adresse_two:</strong>&nbsp;</td>  <td>          no_value        </td></tr><tr>  
<td><strong>telefone:</strong>&nbsp;</td>  <td>         0000736111/680040        </td></tr><tr>  
<td><strong>Fax:</strong>&nbsp;</td>  <td>          0000736111/680040        </td></tr><tr>  
<td><strong>E-Mail:</strong>&nbsp;</td>  <td>       Keine Angabe        </td></tr><tr>      
<td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysite.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong>the office:</strong>&nbsp;</td>   
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr> 
<td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr> 
<td><strong>officer:</strong>&nbsp;</td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong>&nbsp;</td>  <td> 259        </td></tr><tr>  
<td><strong>offices:</strong>&nbsp;</td>  <td>     8        </td></tr><tr>  
<td><strong>worker:</strong>&nbsp;</td>  <td>     no_value        </td></tr><tr>  
<td><strong>country:</strong>&nbsp;</td>  <td>    contryname        </td></tr><tr>  
<td><strong>the_council:</strong>&nbsp;</td>  <td> 

Well, the question is, is it possible to do the search in the 5000 files with this attribute: That the line 999 is of interest. In other words, can I tell the HTML-parser that it has to look (and extract) exactly line 999?


Hello dear RedGritty Brick - i have little experience with HTML :: TokeParser

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

BTW; RedGrittyBrick: See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488 in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

love to get hints


回答1:


Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?


EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.



来源:https://stackoverflow.com/questions/3946874/how-can-i-parse-only-part-of-an-html-file-and-ignore-the-rest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!