Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

99封情书 提交于 2019-12-03 21:15:58
NaMarPi

Web scraping without saving the html pages internally using RapidMiner is a two step process:

Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:

  • instead of Crawl Web operator use the Process Documents from Web operator. There will not be an option to specify the output directory, because the results will be loaded into the ExampleSet.

ExampleSet will contain links matching the crawling rules.

Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:

  • put the Extract Information subprocess inside the Process Documents from Web which has been created previously.

ExampleSet will contain the links and the attributes matching the XPath queries.

I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little : http://rapid-i.com/rapidforum/index.php/topic,2753.0.html and http://rapid-i.com/rapidforum/index.php?topic=3851.0.html

See ya ;)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!