web-crawler | 易学教程

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

阅读更多关于 org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

问题 I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors

Avoid bad requests due to relative urls

阅读更多关于 Avoid bad requests due to relative urls

问题 I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:  <a href="../../en/item-to-scrap.html">Link</a> Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once) But my CrawlSpider does not

How to get hover data(ajax) by any crawler php

阅读更多关于 How to get hover data(ajax) by any crawler php

问题 I am crawling one website's data. I am able to whole content on a page. But some data on page comes after hover on some icons and shown as tooltips. So I require that data also. Is it possible with any crawler. I am using PHP and simplehtmldom for parsing/ crawling page. 回答1: Hover data can't be obtained by any crawlers. Crawlers crawl the web page and gets whole data ( HTML page source ). It's view which we can view as soon as we hit URL. Hover need mouse moving action over HTML attribute on

Is there a way to download partial part of a webpage, rather than the whole HTML body, programmatically?

阅读更多关于 Is there a way to download partial part of a webpage, rather than the whole HTML body, programmatically?

问题 We only want a particular element from the HTML document at nytimes.com/technology. This page contains many articles, but we only want the article's title, which is in a . If we use wget, cURL, or any other tools or some package like requests in Python , whole HTML document is returned. Can we limite the returned data to specific element, such as the 's? 回答1: The HTTP protocol knows nothing about HTML or DOM. Using HTTP you can fetch partial documents from supporting web servers using the

Find all the web pages in a domain and its subdomains

阅读更多关于 Find all the web pages in a domain and its subdomains

问题 I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu). I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch

Running multiple spiders using scrapyd

阅读更多关于 Running multiple spiders using scrapyd

问题 I had multiple spiders in my project so decided to run them by uploading to scrapyd server. I had uploaded my project succesfully and i can see all the spiders when i run the command curl http://localhost:6800/listspiders.json?project=myproject when i run the following command curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 Only one spider runs because of only one spider given, but i want to run run multiple spiders here so the following command is right for

How to set Robots.txt or Apache to allow crawlers only at certain hours?

阅读更多关于 How to set Robots.txt or Apache to allow crawlers only at certain hours?

问题 As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours. Is there a method to achieve this? edit: thanks for all the good advice. This is another solution we found. 2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses. the article the setting of IPTables: Using connlimit In newer Linux kernels, there is a connlimit module for iptables. It can be used

Scrapy - Crawl and Scrape a website

阅读更多关于 Scrapy - Crawl and Scrape a website

问题 As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data, The output of my code is as follows: 2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13', u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate

Android GUI crawler

阅读更多关于 Android GUI crawler

问题 Anyone knows a good tool for crawling the GUI of an android app? I found this but couldn't figure out how to run it... 回答1: Personally, I don't think it would be too hard to make a simple GUI crawler using MonkeyRunner and AndroidViewClient. Also, you may want to look into uiautomator and UI Testing Good is a relative term. I have not used Robotium, but it is mentioned in these circles a lot. EDIT - Added example based on comment request. Using MonkeyRunner and AndroidViewClient you can make

How to control the order of yield in Scrapy

阅读更多关于 How to control the order of yield in Scrapy

问题 Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)