问题
What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
回答1:
wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.
回答2:
Python Beautiful Soup allows you to build a nice extractor.
回答3:
I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
回答4:
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
回答5:
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
回答6:
Use wget to download the required html and then run html2text on the output files.
来源:https://stackoverflow.com/questions/435547/html-downloading-and-text-extraction