Should I use Yahoo-Pipes to scrape the contents of a div?

a 夏天 提交于 2019-12-08 02:11:55

问题


Given:

  • Url - http://www.contoso.com/search.php?q={param} returns:

    -html-
    --body-
    {...}
    ---div id='foo'-
    ----div id='page1'/-
    ----div id='page2'/-
    ----div id='page3'/-
    ----div id='pageN'/-
    ---/div-
    {...}
    --/body-
    -/html-

Wanted:

  • The innerHtml of div id='foo' must be fetched by the client (i.e. Javascript).
    • It will be split into discrete items (i.e. div id='page1' to div id='pageN').
  • API Throttling prevents server-side code from pre-fetching the data, so the parsing and manipulation burden must be placed on the client.

Question:

  • Could Yahoo-Pipes help format the data for easier consumption?
    • The lack of a DOM parser gives me pause.
  • Are there any existing pipes that could serve as an example?

回答1:


You can use the YQL module, which allows you to fetch arbitrary URLs and then parse them with XPath. A sample YQL query:

select * from html where url="http://finance.yahoo.com/q?s=yhoo" and
  xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'



回答2:


Yes, it's doable with Y! Pipes. You only need two modules from the 'Operators section':

First "Sub Element" to get only the content.

Then just use the "Regex" module to extract the div content and get it through JSON from your site:

Search:

^.*?<div id="foo">(.*?)</div>.*?$

Replace:

$1



来源:https://stackoverflow.com/questions/1095557/should-i-use-yahoo-pipes-to-scrape-the-contents-of-a-div

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!