HTML downloading and text extraction

不羁的心 提交于 2019-12-04 10:29:46
dsm

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

Python Beautiful Soup allows you to build a nice extractor.

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

Use wget to download the required html and then run html2text on the output files.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!