HTML downloading and text extraction

问题

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

回答1:

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

回答2:

Python Beautiful Soup allows you to build a nice extractor.

回答3:

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

回答4:

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

回答5:

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

回答6:

Use wget to download the required html and then run html2text on the output files.

来源：https://stackoverflow.com/questions/435547/html-downloading-and-text-extraction

标签

html

Linux

text-extraction

download

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!