web-crawler | 易学教程

Scrapy - Spider crawls duplicate urls

阅读更多关于 Scrapy - Spider crawls duplicate urls

问题 I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow. The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going

is IFrame crawled by Google?

阅读更多关于 is IFrame crawled by Google?

问题 I have iframe that it is source is got from servlet response, so does the content of the iframe will be crawled? 回答1: Google does crawl the framed content now. Just not sure yet how much equity is passed to the links. http://www.seroundtable.com/google-iframe-link-14558.html http://www.rimmkaufman.com/blog/do-search-engines-follow-links-in-iframes/31012012/ What Google surely stil not do is associating the framed content with the parent page. So your pagerank will not be influenced. 回答2: No I

Is it legal to crawl Amazon? [closed]

阅读更多关于 Is it legal to crawl Amazon? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I want to have specific information from amazon like product name and description! Is it legal to crawl amazon. or Is amazon is providing any api for getting its data paid or nonpaid both 回答1: Amazon's "Product Advertising API" allows this. You should closely read the license agreement as its highly restrictive

Web-scraping across multipages without even knowing the last page number

阅读更多关于 Web-scraping across multipages without even knowing the last page number

问题 Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What

mysterious rails error with almost no trace

阅读更多关于 mysterious rails error with almost no trace

问题 We're having a strange problem with one crawler. Occasionally it will throw a Rails FATAL error on some request, but the trace is very limited and looks something like this [2014-07-01 18:16:37] FATAL Rails : ArgumentError (invalid %-encoding (c ^ FK+ 9u$_ t Kl ΥE! =k \ ̕* ߚ>c+<O یo ʘ> C R! 2 D (5 x q#!` 4 p |8 I E :+ H^9`^ # Vo{ > =[z )): lib/locale_middleware.rb:14:in `call' The crawler user-agent is Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html) We can ask

how to totally ignore 'debugger' statement in chrome?

阅读更多关于 how to totally ignore 'debugger' statement in chrome?

问题 'never pause here' can not work after I continue： still paused 回答1: For totally ignore all breakpoints in the chrome you must do as follows: Open the Chrome browser. Press F12(inspect) or Right-click on the chrome and select inspect In the source menu, press Ctrl+F8 to deactivate all breakpoints(alternative: at the top-right-corner select deactivate breakpoints) All breakpoints and debugger statements will be deactivated. 来源： https://stackoverflow.com/questions/45767855/how-to-totally-ignore

Yield multiple items using scrapy

阅读更多关于 Yield multiple items using scrapy

问题 I'm scraping data from the following URL: http://www.indexmundi.com/commodities/?commodity=gasoline There are two sections which contain price: Gulf Coast Gasoline Futures End of Day Settlement Price and Gasoline Daily Price I want to scrape data from both sections as two different items . Here is the code which I've written: if dailyPrice: item['description'] = u''.join(dailyPrice.xpath(".//h1/text()").extract()) item['price'] = u''.join(dailyPrice.xpath(".//span/text()").extract()) item[

Running Multiple spiders in scrapy

阅读更多关于 Running Multiple spiders in scrapy

问题 In scrapy for example if i had two URL's that contains different HTML. Now i want to write two individual spiders each for one and want to run both the spiders at once. In scrapy is it possible to run multiple spiders at once. In scrapy after writing multiple spiders, how can we schedule them to run for every 6 hours(May be like cron jobs) I had no idea of above , can u suggest me how to perform the above things with an example. Thanks in advance. 回答1: It would probably be easiest to just run

Is the User-Agent line in robots.txt an exact match or a substring match?

阅读更多关于 Is the User-Agent line in robots.txt an exact match or a substring match?

问题 When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent? Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match. However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from

Find text inside javascript tag using PHP Simple HTML DOM Parser

阅读更多关于 Find text inside javascript tag using PHP Simple HTML DOM Parser

问题 I'm trying to find a text change regularly inside javascript tag : <script type="text/javascript"> jwplayer("mediaplayer").setup({ flashplayer: "player.swf", file:"filename", provider: "rtmp", streamer:"rtmp://192.168.1.1/file?wmsAuthSign=RANDOM-114-Character==", height:500, width:500, }); </script> How to get RANDOM-114-Character (or full value of 'streamer' flashvars) using PHP Simple HTML DOM Parser, I just have no idea to do this. 回答1: You can do it with regular expression: preg_match (