Create Great Parser - Extract Relevant Text From HTML/Blogs

前端未结

关注

 2  921

挽巷 2020-12-23 12:26

I\'m trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie\'s URL and get back clean text of the post it

2条回答

南笙 (楼主)

2020-12-23 12:59
Boy, do I have the perfect solution for you.

Arc90's readability algorithm does exactly this. Given HTML content, it picks out the content of the main blog post text, ignoring headers, footers, navigation, etc.

Here are implementations in:
- JavaScript
- Perl
- PHP
- Python
- Ruby
- C#
~~I'll be releasing a Perl port to CPAN in a couple of days.~~ Done.

Hope this helps!
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...