web-crawler

php convert all links to absolute urls

有些话、适合烂在心里 提交于 2019-12-10 14:13:59
问题 I am writing a website crawler in php and I already have code that can extract all links from a site. A problem: sites use a combination of absolute and relative urls. Examples (http replaced with hxxp as I can't post hyperlinks): hxxp://site.com/ site.com site.com/index.php hxxp://site.com/hello/index.php /hello/index.php hxxp://site2.com/index.php site2.com/index.php I have no control over the links (if they are absolute/relative), but I do need to follow them. I need to convert all these

Scrapy0.22: An error occured while connecting: <class 'twisted.internet.error.ConnectionLost'>

醉酒当歌 提交于 2019-12-10 12:21:13
问题 Good morning, I get a connection error while executing one of my spiders: 2014-02-28 10:21:00+0400 [butik] DEBUG: Retrying <GET http://www.butik.ru/> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.]. Afterwards the spider shuts down. All other spiders with a smiliar structure are running smoothly, but this

Crawling websites which ask for authentication

允我心安 提交于 2019-12-10 12:18:54
问题 I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password Work around:I have set the auth-configuration in httpclient-auth.xml file: <auth-configuration> <credentials username="xyz" password="xyz"> <default realm="domain" /> <authscope host="www.gmail.com" port="80"/> </credentials> </auth-configuration> ii)Define httpclient property in both nutch-site.xml and nutch-default.xml <property> <name>plugin.includes<

Implementing Threads Into Java Web Crawler

梦想的初衷 提交于 2019-12-10 12:06:24
问题 Here is the original web crawler in which i wrote: (Just for reference) https://github.com/domshahbazi/java-webcrawler/tree/master This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing

scrapy crawler to pass multiple item classes to pipeline

為{幸葍}努か 提交于 2019-12-10 12:02:40
问题 Hi i am very new to Python and Scrapy, this is my first code and i cant solve a problem that looks pretty basic. I have the crawler set to do two things: 1- Find all pagination URLs, visit them and get some data from each page 2- Get all links listed on the results pages, visite them and crawl for each location data I am taking the decision of each item to parse using rules with callback. I created to classes inside items.py for each parser The second rule is processing perfect but the first

Scrapy not crawling subsequent pages in order

若如初见. 提交于 2019-12-10 11:57:01
问题 I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types). Here is the code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from lonelyplanet.items import LonelyplanetItem class LonelyplanetSpider(CrawlSpider): name = "lonelyplanetItemName_spider" allowed_domains = ["lonelyplanet.com"]

Python requests error 10060

北战南征 提交于 2019-12-10 11:43:34
问题 I have a script that crawls a website. Untill today it ran perfect, However it does not do so now. it give sme the following error: Connection Aborted Error(10060 ' A connection attempt failed becvause the connected party did not properly respond after a period of time, or established a connection failed because connected host has failed to respond' I have been looking into answers ans settings but i cannot figure out how to fix this... In IE i am not using any Proxy (Connection -> Lan

Asynchronous crawling F#

♀尐吖头ヾ 提交于 2019-12-10 11:37:40
问题 When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response. let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) = async{ let req = (WebRequest.Create

php crawler detection

大兔子大兔子 提交于 2019-12-10 10:56:16
问题 I'm trying to write a sitemap.php which acts differently depending on who is looking. I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page. This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not. Can anyone help crack this for me? function getIsCrawler($userAgent)

Syntax error, insert “… VariableDeclaratorId” to complete FormalParameterList

折月煮酒 提交于 2019-12-10 10:06:31
问题 I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /