Unix - parse html file and get all his resources list

问题

I have an html file and i need to generate a list of all the resources it uses: *.htm, *.html, *.css, *.js, *.jpg

I tried many options like grep and sed, without much sucess. Also am not sure how to do itin JAVA.

This is an example file content:

--------------------------------


>   <link rel="StyleSheet" href="css/webworks.css" type="text/css"
> media="all" />
>     <script type="text/javascript" language="JavaScript1.2"   src="wwhdata/common        /context.js">
>     /script>
>     <a class="WebWorks_Breadcrumb_Link" href="Page1.htm#1110364">Job Status</a> &gt;  Jobs tatus</div>
>     <div class="Indented"><a name="1115395">The <img class="Default"  src="images/Pic.2.jpg" width="26" height="29" style="display: inline;
 > float: none; left: 0.0; top: 0.0;" alt="" /> icon indicates that the
 > job is recurring. Hover the mouse over the icon to display the
     > schedule.</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page4.htm#1110375', '');"
 > title="fsafsa" name="1118038">abcde</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page2.htm#1110547', '');"
  > title="fsafsa" name="1118063">fsafsa</a></div>
  >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2', 'Page3.htm#1110472', '');"
 > title="fsafasb" name="1118082">fsafsa</a></div>

Output should be:

-----------------
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm

回答1:

The following should get you some of the way:

% sed -n -E 's/.*(href|src)="([^"]*).*/\2/p' input.html

The -n means don't print lines by default; the -E means use extended regular expressions (so we can use the vertical bar for alternation); the trailing p on the substitution means print out any lines which have a successful substitution on them. Together, this finds any lines which have a href= or src= on them, replaces the entire line by what's between the "..." or up to a #, and prints out the result.

On your input, this produces:

css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
javascript:WWHClickedPopup('HelpSR2',   'Page4.htm
javascript:WWHClickedPopup('HelpSR2',   'Page2.htm
javascript:WWHClickedPopup('HelpSR2', 'Page3.htm

Limitations of this simple version:

it won't work if there's more than one href or src on a line;
it fails to extract the contents of the Javascript argument;
it presumes that the input uses "..." rather than '...' to delimit file names.

Each of these could probably be improved by suitable additions to the sed script, though the second would probably be best done by sending the output through another sed script or...

% cat /tmp/t.sed
s/.*(href|src)="([^#"]*).*/\2/
s/javascript.*'//
t x
b
:x
p
% sed -n -E -f /tmp/t.sed /tmp/so.txt
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm
%

That last one's a little bit special! I'll leave you and the manpage to work out the details.

回答2:

Use JSOUP

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

来源：https://stackoverflow.com/questions/11123408/unix-parse-html-file-and-get-all-his-resources-list

标签

java

unix

sed

grep

html-parsing