Unix - parse html file and get all his resources list

瘦欲@ 提交于 2019-12-11 08:25:45

问题


I have an html file and i need to generate a list of all the resources it uses: *.htm, *.html, *.css, *.js, *.jpg

I tried many options like grep and sed, without much sucess. Also am not sure how to do itin JAVA.

This is an example file content:

--------------------------------


>   <link rel="StyleSheet" href="css/webworks.css" type="text/css"
> media="all" />
>     <script type="text/javascript" language="JavaScript1.2"   src="wwhdata/common        /context.js">
>     /script>
>     <a class="WebWorks_Breadcrumb_Link" href="Page1.htm#1110364">Job Status</a> &gt;  Jobs tatus</div>
>     <div class="Indented"><a name="1115395">The <img class="Default"  src="images/Pic.2.jpg" width="26" height="29" style="display: inline;
 > float: none; left: 0.0; top: 0.0;" alt="" /> icon indicates that the
 > job is recurring. Hover the mouse over the icon to display the
     > schedule.</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page4.htm#1110375', '');"
 > title="fsafsa" name="1118038">abcde</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page2.htm#1110547', '');"
  > title="fsafsa" name="1118063">fsafsa</a></div>
  >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2', 'Page3.htm#1110472', '');"
 > title="fsafasb" name="1118082">fsafsa</a></div>

Output should be:

-----------------
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm

回答1:


The following should get you some of the way:

% sed -n -E 's/.*(href|src)="([^"]*).*/\2/p' input.html

The -n means don't print lines by default; the -E means use extended regular expressions (so we can use the vertical bar for alternation); the trailing p on the substitution means print out any lines which have a successful substitution on them. Together, this finds any lines which have a href= or src= on them, replaces the entire line by what's between the "..." or up to a #, and prints out the result.

On your input, this produces:

css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
javascript:WWHClickedPopup('HelpSR2',   'Page4.htm
javascript:WWHClickedPopup('HelpSR2',   'Page2.htm
javascript:WWHClickedPopup('HelpSR2', 'Page3.htm

Limitations of this simple version:

  • it won't work if there's more than one href or src on a line;
  • it fails to extract the contents of the Javascript argument;
  • it presumes that the input uses "..." rather than '...' to delimit file names.

Each of these could probably be improved by suitable additions to the sed script, though the second would probably be best done by sending the output through another sed script or...

% cat /tmp/t.sed
s/.*(href|src)="([^#"]*).*/\2/
s/javascript.*'//
t x
b
:x
p
% sed -n -E -f /tmp/t.sed /tmp/so.txt
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm
%

That last one's a little bit special! I'll leave you and the manpage to work out the details.




回答2:


Use JSOUP

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.



来源:https://stackoverflow.com/questions/11123408/unix-parse-html-file-and-get-all-his-resources-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!