HTML downloading and text extraction

雨燕双飞 提交于 2019-12-06 04:10:24

问题


What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.


回答1:


wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.




回答2:


Python Beautiful Soup allows you to build a nice extractor.




回答3:


I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.




回答4:


Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.




回答5:


PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.




回答6:


Use wget to download the required html and then run html2text on the output files.



来源:https://stackoverflow.com/questions/435547/html-downloading-and-text-extraction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!