web-crawler | 易学教程

php convert all links to absolute urls

阅读更多关于 php convert all links to absolute urls

问题 I am writing a website crawler in php and I already have code that can extract all links from a site. A problem: sites use a combination of absolute and relative urls. Examples (http replaced with hxxp as I can't post hyperlinks): hxxp://site.com/ site.com site.com/index.php hxxp://site.com/hello/index.php /hello/index.php hxxp://site2.com/index.php site2.com/index.php I have no control over the links (if they are absolute/relative), but I do need to follow them. I need to convert all these

Scrapy0.22: An error occured while connecting: <class 'twisted.internet.error.ConnectionLost'>

阅读更多关于 Scrapy0.22: An error occured while connecting:

问题 Good morning, I get a connection error while executing one of my spiders: 2014-02-28 10:21:00+0400 [butik] DEBUG: Retrying <GET http://www.butik.ru/> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.]. Afterwards the spider shuts down. All other spiders with a smiliar structure are running smoothly, but this

Crawling websites which ask for authentication

阅读更多关于 Crawling websites which ask for authentication

问题 I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password Work around:I have set the auth-configuration in httpclient-auth.xml file: <auth-configuration> <credentials username="xyz" password="xyz"> <default realm="domain" /> <authscope host="www.gmail.com" port="80"/> </credentials> </auth-configuration> ii)Define httpclient property in both nutch-site.xml and nutch-default.xml <property> <name>plugin.includes<

Implementing Threads Into Java Web Crawler

阅读更多关于 Implementing Threads Into Java Web Crawler

问题 Here is the original web crawler in which i wrote: (Just for reference) https://github.com/domshahbazi/java-webcrawler/tree/master This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing

scrapy crawler to pass multiple item classes to pipeline

阅读更多关于 scrapy crawler to pass multiple item classes to pipeline

问题 Hi i am very new to Python and Scrapy, this is my first code and i cant solve a problem that looks pretty basic. I have the crawler set to do two things: 1- Find all pagination URLs, visit them and get some data from each page 2- Get all links listed on the results pages, visite them and crawl for each location data I am taking the decision of each item to parse using rules with callback. I created to classes inside items.py for each parser The second rule is processing perfect but the first

Scrapy not crawling subsequent pages in order

阅读更多关于 Scrapy not crawling subsequent pages in order

问题 I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types). Here is the code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from lonelyplanet.items import LonelyplanetItem class LonelyplanetSpider(CrawlSpider): name = "lonelyplanetItemName_spider" allowed_domains = ["lonelyplanet.com"]

Python requests error 10060

阅读更多关于 Python requests error 10060

问题 I have a script that crawls a website. Untill today it ran perfect, However it does not do so now. it give sme the following error: Connection Aborted Error(10060 ' A connection attempt failed becvause the connected party did not properly respond after a period of time, or established a connection failed because connected host has failed to respond' I have been looking into answers ans settings but i cannot figure out how to fix this... In IE i am not using any Proxy (Connection -> Lan

Asynchronous crawling F#

阅读更多关于 Asynchronous crawling F#

问题 When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response. let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) = async{ let req = (WebRequest.Create

php crawler detection

阅读更多关于 php crawler detection

问题 I'm trying to write a sitemap.php which acts differently depending on who is looking. I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page. This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not. Can anyone help crack this for me? function getIsCrawler($userAgent)

Syntax error, insert “… VariableDeclaratorId” to complete FormalParameterList

阅读更多关于 Syntax error, insert “… VariableDeclaratorId” to complete FormalParameterList

问题 I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /