web-crawler

How to Get Crawl content in Crawljax

放肆的年华 提交于 2019-12-01 13:07:25
问题 I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me?? CrawljaxConfigurationBuilder builder = CrawljaxConfiguration.builderFor("http://demo.crawljax.com/"); builder.addPlugin(new OnNewStatePlugin() { @Override public String toString() { return "Our example plugin"; } @Override public void onNewState(CrawlerContext cc, StateVertex sv) { LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser()

scrapy crawl [spider-name] fault

旧时模样 提交于 2019-12-01 12:54:33
Hi guys i am building a web scraping project using scrapy framework and python. In spider folder of my project i have two spiders named spider1 and spider2 spider1.py class spider(BaseSpider): name= "spider1" ........ ........ spider2.py class spider(BaseSpider): name="spider2" ............ ........... settings.py SPIDER_MODULES = ['project_name.spiders'] NEWSPIDER_MODULE = ['project_name.spiders'] ITEM_PIPELINES = ['project_name.pipelines.spider'] Now when i write the command scrapy crawl spider1 in my root project folder it calls spider2.py instead of spider1.py. when i will delete spider2

How would I download all kinds of file types from a website?

无人久伴 提交于 2019-12-01 12:16:05
I have the following code in a new class: using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using HtmlAgilityPack; using System.IO; using System.Text.RegularExpressions; using System.Xml.Linq; using System.Net; using System.Web; using System.Threading; using DannyGeneral; using GatherLinks; namespace GatherLinks { class RetrieveWebContent { HtmlAgilityPack.HtmlDocument doc; string imgg; int images; public RetrieveWebContent() { images = 0; } public List<string>

calling controller(crawler4j-3.5) inside loop

泄露秘密 提交于 2019-12-01 12:11:10
Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str); controller.addSeed(str); controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); controller

scrapy crawl [spider-name] fault

百般思念 提交于 2019-12-01 11:46:24
问题 Hi guys i am building a web scraping project using scrapy framework and python. In spider folder of my project i have two spiders named spider1 and spider2 spider1.py class spider(BaseSpider): name= "spider1" ........ ........ spider2.py class spider(BaseSpider): name="spider2" ............ ........... settings.py SPIDER_MODULES = ['project_name.spiders'] NEWSPIDER_MODULE = ['project_name.spiders'] ITEM_PIPELINES = ['project_name.pipelines.spider'] Now when i write the command scrapy crawl

How to automatically retrieve URL AJAX calls to?

三世轮回 提交于 2019-12-01 11:44:53
问题 The aim is to programme a crawlspider able to: 1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en.html 2) Follow the AJAX call from all those URLs to find out the final (“AJAX”)URLs containing the data that I want to scrape 3)Scrape the final pages identified by the AJAX URLs. So far, I have written two spiders under Scrapy: 1) The first one retrieves the URL from the links on the start page. Here is the code: from scrapy

Extract links for certain section only from blogspot using BeautifulSoup

谁都会走 提交于 2019-12-01 11:04:32
问题 I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page. Here is the codes: import urlparse import urllib from bs4 import BeautifulSoup url = "http://ellywonderland.blogspot.com/" urls = [url] visited = [url] while len(urls) >0: try: htmltext = urllib.urlopen(urls[0]).read() except: print urls[0] soup = BeautifulSoup(htmltext) urls.pop(0) print len (urls) for tags in soup.find_all(attrs={'class': "post-title

Webcrawler in Go

廉价感情. 提交于 2019-12-01 10:34:04
I'm trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in the queue. When the queue has less elements than workers, workers should shout down, but resume in case more links are found. The code I have tried is const max_workers = 6 // simulating links with int func crawl(wg *sync.WaitGroup, queue chan int) { for element := range queue { wg.Done() // why is defer here causing a deadlock? fmt.Println("adding 2 new elements ") if element%2 == 0 { wg.Add(2) queue <- (element*100 + 11)

How to write python scrapy code for extracting url's present in sitemap of a site

ぐ巨炮叔叔 提交于 2019-12-01 10:00:38
问题 I'm trying to use this code to get list of urls in sitemap. when i run this, i see no results in the screen. could anyone tell me whats the problem or suggest me better one with good example. thanks in advance class MySpider(SitemapSpider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url return Request(response.url, callback=self.parse_sitemap_url) def parse_sitemap_url(self, response): # do stuff with

Webcrawler in Go

风流意气都作罢 提交于 2019-12-01 08:07:38
问题 I'm trying to build a web crawler in Go where I would like to specify the max number of concurrent workers. They will all be working as long as there are link to explore in the queue. When the queue has less elements than workers, workers should shout down, but resume in case more links are found. The code I have tried is const max_workers = 6 // simulating links with int func crawl(wg *sync.WaitGroup, queue chan int) { for element := range queue { wg.Done() // why is defer here causing a