Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

问题

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)

I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?

I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.

I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.

Looking forward to your replies.

回答1:

Web scraping without saving the html pages internally using RapidMiner is a two step process:

Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:

instead of Crawl Web operator use the Process Documents from Web operator. There will not be an option to specify the output directory, because the results will be loaded into the ExampleSet.

ExampleSet will contain links matching the crawling rules.

Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:

put the Extract Information subprocess inside the Process Documents from Web which has been created previously.

ExampleSet will contain the links and the attributes matching the XPath queries.

回答2:

I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little : http://rapid-i.com/rapidforum/index.php/topic,2753.0.html and http://rapid-i.com/rapidforum/index.php?topic=3851.0.html

See ya ;)

来源：https://stackoverflow.com/questions/9045024/can-rapidminer-extract-xpaths-from-a-list-of-urls-instead-of-first-saving-the-h

标签

xpath

screen-scraping

web-scraping

data-mining

rapidminer