How to extract text contents from html like Read it later or InstaPaper Iphone app?

六月ゝ 毕业季﹏ 提交于 2019-12-03 00:55:43

问题


I want to extract main article content from html on my Iphone app and show it on TextView or CoreText.

Read it later and InstaPaper Iphone apps have this feature, but after researching on web, I still can't tell how they do this.

At the moment, I take text content from html by this code, but it takes lots of no need contents too.

textArticle = [webView stringByEvaluatingJavaScriptFromString:@"document.body.innerText"];

This question is what I wanted, but sadly it was not for Iphone app.
Instapaper-like algorithm

This is open source for this kind of feature, but I am not sure if I can use it for Iphone app. https://github.com/jiminoc/goose/wiki

It seems smartr provided api for that before, but it is not available now. http://smartrmobi.blogspot.com/2011/02/smartr-api-withdrawn-until-further.html

Maybe, easiest way to do this is get article content from xml element, but this is only my guess.

I would like to know where to start so I'd really appreciate for any suggestions.

Thanks


回答1:


After researching, it seems I can use api to extract text contents from web. It means I need to access webpage after I got url and render the result again.

It is slower than just using js script showed above because it needs to access web api but read it later and instapaper both are using this approach I guess.

The followings are the web api I found so far.

http://viewtext.org/

this api has very nice feature which combines multi-page articles into one. I am using this api because of this feature which other api do not have.

http://fivefilters.org/content-only/

great thing about this is you can buy script and set up on your own server.

*UPDATE*

It seems that most apps use "Readability" or "Instapaper" or "Google" mobilizer to parse only text contents from the web.

Among them, my favorite is "Readability" parser at the moment, since it doesn't come with advertisement like Instapaper parser. (Nothing wrong about putting ads to cover the server cost though)

Pocket also provides article parser only for developers who creating pocket integrated apps.




回答2:


Use NReadability, together with HtmlAgilityPack.

Reference: How to extract Article Text contents from HTML page like Pocket (Read It Later) or Readability?




回答3:


Use Newspaper3k, It's awosome.

News, full-text, and article metadata extraction in Python 3.

https://github.com/codelucas/newspaper



来源:https://stackoverflow.com/questions/5960948/how-to-extract-text-contents-from-html-like-read-it-later-or-instapaper-iphone-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!