web-crawler | 易学教程

How to Get Crawl content in Crawljax

阅读更多关于 How to Get Crawl content in Crawljax

问题 I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me?? CrawljaxConfigurationBuilder builder = CrawljaxConfiguration.builderFor("http://demo.crawljax.com/"); builder.addPlugin(new OnNewStatePlugin() { @Override public String toString() { return "Our example plugin"; } @Override public void onNewState(CrawlerContext cc, StateVertex sv) { LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser()

scrapy crawl [spider-name] fault

阅读更多关于 scrapy crawl [spider-name] fault

Hi guys i am building a web scraping project using scrapy framework and python. In spider folder of my project i have two spiders named spider1 and spider2 spider1.py class spider(BaseSpider): name= "spider1" ........ ........ spider2.py class spider(BaseSpider): name="spider2" ............ ........... settings.py SPIDER_MODULES = ['project_name.spiders'] NEWSPIDER_MODULE = ['project_name.spiders'] ITEM_PIPELINES = ['project_name.pipelines.spider'] Now when i write the command scrapy crawl spider1 in my root project folder it calls spider2.py instead of spider1.py. when i will delete spider2

How would I download all kinds of file types from a website?

阅读更多关于 How would I download all kinds of file types from a website?

I have the following code in a new class: using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using HtmlAgilityPack; using System.IO; using System.Text.RegularExpressions; using System.Xml.Linq; using System.Net; using System.Web; using System.Threading; using DannyGeneral; using GatherLinks; namespace GatherLinks { class RetrieveWebContent { HtmlAgilityPack.HtmlDocument doc; string imgg; int images; public RetrieveWebContent() { images = 0; } public List<string>

calling controller(crawler4j-3.5) inside loop

阅读更多关于 calling controller(crawler4j-3.5) inside loop

Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str); controller.addSeed(str); controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); controller

scrapy crawl [spider-name] fault

阅读更多关于 scrapy crawl [spider-name] fault

问题 Hi guys i am building a web scraping project using scrapy framework and python. In spider folder of my project i have two spiders named spider1 and spider2 spider1.py class spider(BaseSpider): name= "spider1" ........ ........ spider2.py class spider(BaseSpider): name="spider2" ............ ........... settings.py SPIDER_MODULES = ['project_name.spiders'] NEWSPIDER_MODULE = ['project_name.spiders'] ITEM_PIPELINES = ['project_name.pipelines.spider'] Now when i write the command scrapy crawl

How to automatically retrieve URL AJAX calls to?

阅读更多关于 How to automatically retrieve URL AJAX calls to?

问题 The aim is to programme a crawlspider able to: 1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en.html 2) Follow the AJAX call from all those URLs to find out the final (“AJAX”)URLs containing the data that I want to scrape 3)Scrape the final pages identified by the AJAX URLs. So far, I have written two spiders under Scrapy: 1) The first one retrieves the URL from the links on the start page. Here is the code: from scrapy

Extract links for certain section only from blogspot using BeautifulSoup

阅读更多关于 Extract links for certain section only from blogspot using BeautifulSoup

问题 I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page. Here is the codes: import urlparse import urllib from bs4 import BeautifulSoup url = "http://ellywonderland.blogspot.com/" urls = [url] visited = [url] while len(urls) >0: try: htmltext = urllib.urlopen(urls[0]).read() except: print urls[0] soup = BeautifulSoup(htmltext) urls.pop(0) print len (urls) for tags in soup.find_all(attrs={'class': "post-title

Webcrawler in Go

阅读更多关于 Webcrawler in Go

I'm trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in the queue. When the queue has less elements than workers, workers should shout down, but resume in case more links are found. The code I have tried is const max_workers = 6 // simulating links with int func crawl(wg *sync.WaitGroup, queue chan int) { for element := range queue { wg.Done() // why is defer here causing a deadlock? fmt.Println("adding 2 new elements ") if element%2 == 0 { wg.Add(2) queue <- (element*100 + 11)

How to write python scrapy code for extracting url's present in sitemap of a site

阅读更多关于 How to write python scrapy code for extracting url's present in sitemap of a site

问题 I'm trying to use this code to get list of urls in sitemap. when i run this, i see no results in the screen. could anyone tell me whats the problem or suggest me better one with good example. thanks in advance class MySpider(SitemapSpider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url return Request(response.url, callback=self.parse_sitemap_url) def parse_sitemap_url(self, response): # do stuff with

Webcrawler in Go

阅读更多关于 Webcrawler in Go

问题 I'm trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in the queue. When the queue has less elements than workers, workers should shout down, but resume in case more links are found. The code I have tried is const max_workers = 6 // simulating links with int func crawl(wg *sync.WaitGroup, queue chan int) { for element := range queue { wg.Done() // why is defer here causing a