web-crawler

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

别等时光非礼了梦想. 提交于 2020-01-02 13:56:13
问题 I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors

Avoid bad requests due to relative urls

淺唱寂寞╮ 提交于 2020-01-02 08:52:28
问题 I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind: <!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) --> <a href="../../en/item-to-scrap.html">Link</a> Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once) But my CrawlSpider does not

How to get hover data(ajax) by any crawler php

两盒软妹~` 提交于 2020-01-02 08:30:10
问题 I am crawling one website's data. I am able to whole content on a page. But some data on page comes after hover on some icons and shown as tooltips. So I require that data also. Is it possible with any crawler. I am using PHP and simplehtmldom for parsing/ crawling page. 回答1: Hover data can't be obtained by any crawlers. Crawlers crawl the web page and gets whole data ( HTML page source ). It's view which we can view as soon as we hit URL. Hover need mouse moving action over HTML attribute on

Is there a way to download partial part of a webpage, rather than the whole HTML body, programmatically?

半腔热情 提交于 2020-01-02 08:24:24
问题 We only want a particular element from the HTML document at nytimes.com/technology. This page contains many articles, but we only want the article's title, which is in a . If we use wget, cURL, or any other tools or some package like requests in Python , whole HTML document is returned. Can we limite the returned data to specific element, such as the 's? 回答1: The HTTP protocol knows nothing about HTML or DOM. Using HTTP you can fetch partial documents from supporting web servers using the

Find all the web pages in a domain and its subdomains

浪尽此生 提交于 2020-01-02 08:04:09
问题 I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu). I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch

Running multiple spiders using scrapyd

会有一股神秘感。 提交于 2020-01-02 07:24:07
问题 I had multiple spiders in my project so decided to run them by uploading to scrapyd server. I had uploaded my project succesfully and i can see all the spiders when i run the command curl http://localhost:6800/listspiders.json?project=myproject when i run the following command curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 Only one spider runs because of only one spider given, but i want to run run multiple spiders here so the following command is right for

How to set Robots.txt or Apache to allow crawlers only at certain hours?

浪子不回头ぞ 提交于 2020-01-02 05:05:10
问题 As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours. Is there a method to achieve this? edit: thanks for all the good advice. This is another solution we found. 2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses. the article the setting of IPTables: Using connlimit In newer Linux kernels, there is a connlimit module for iptables. It can be used

Scrapy - Crawl and Scrape a website

让人想犯罪 __ 提交于 2020-01-01 18:54:15
问题 As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data, The output of my code is as follows: 2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13', u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate

Android GUI crawler

帅比萌擦擦* 提交于 2020-01-01 17:57:55
问题 Anyone knows a good tool for crawling the GUI of an android app? I found this but couldn't figure out how to run it... 回答1: Personally, I don't think it would be too hard to make a simple GUI crawler using MonkeyRunner and AndroidViewClient. Also, you may want to look into uiautomator and UI Testing Good is a relative term. I have not used Robotium, but it is mentioned in these circles a lot. EDIT - Added example based on comment request. Using MonkeyRunner and AndroidViewClient you can make

How to control the order of yield in Scrapy

牧云@^-^@ 提交于 2020-01-01 12:01:37
问题 Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)