web-crawler

how to use two level proxy setting in Python?

荒凉一梦 提交于 2019-11-30 21:13:39
I am working on web-crawler [using python]. Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls. Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through. My question is: This is two level proxy, one to connect to main server-1, I use proxy and then after to shuffle through proxies, I want to use proxy. How can I achieve this? Update Sounds like you're looking to connect to proxy A

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

余生颓废 提交于 2019-11-30 20:35:00
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this: How to get

Nutch API advice

不羁的心 提交于 2019-11-30 20:33:53
I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or similar), minimally configure it and start it, nothing fancy. Is there some example for this, or some

Web Scraper for dynamic forms in python

我的未来我决定 提交于 2019-11-30 19:22:44
问题 I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state, there is an onclick java script event running which gets the values of corresponding cities in a state. I am familiar with mechanize module in python. I came across several links telling me that I

How to scroll down in Python Selenium step by step

a 夏天 提交于 2019-11-30 19:02:00
问题 Hi guys I am new to Selenium and Python. I was just scraping the site pagalguy website. I know how to scroll down to the bottom of the page but what I need is to scroll down step by step so that the Selenium will click all the readmore buttons,but I don't know how to scroll down step by step like that so I hard coded it like following one browser.execute_script("window.scrollTo(0,300);") browser.find_element_by_link_text("Read More...").click() browser.execute_script("window.scrollTo(300,600)

Twisted errors in Scrapy spider

不问归期 提交于 2019-11-30 16:28:21
When I run the spider from the Scrapy tutorial I get these error messages: File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeywords=kw) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 186, in addCallbacks self._runCallbacks() File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 328, in_runCallbacks self.result = callback(self.result, *args, **kw) --- <exception

Twisted errors in Scrapy spider

試著忘記壹切 提交于 2019-11-30 16:08:14
问题 When I run the spider from the Scrapy tutorial I get these error messages: File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeywords=kw) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 186, in addCallbacks self._runCallbacks() File "C:\Python26\lib\site-packages\twisted\internet

Calling scrapy from a python script not creating JSON output file

浪尽此生 提交于 2019-11-30 15:58:56
Here's the python script that i am using to call scrapy, the answer of Scrapy crawl from script always blocks script execution after scraping def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = MySpider(start_url='abc') crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('Running reactor...') reactor.run() # the script will block here until the spider is closed log.msg('Reactor stopped.') here's my pipelines.py code from scrapy import log,signals from scrapy.contrib.exporter import

the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider

ε祈祈猫儿з 提交于 2019-11-30 15:35:47
问题 Help! Reading the source code of Scrapy is not easy for me. I have a very long start_urls list. it is about 3,000,000 in a file. So,I make the start_urls like this: start_urls = read_urls_from_file(u"XXXX") def read_urls_from_file(file_path): with codecs.open(file_path, u"r", encoding=u"GB18030") as f: for line in f: try: url = line.strip() yield url except: print u"read line:%s from file failed!" % line continue print u"file read finish!" MeanWhile, my spider's callback functions are like

Get proxy ip address scrapy using to crawl

大城市里の小女人 提交于 2019-11-30 15:31:52
I use Tor to crawl web pages. I started tor and polipo service and added class ProxyMiddleware(object): # overwrite process request def process_request(self, request, spider): # Set the location of the proxy request.meta['proxy'] = "127.0.0.1:8123" Now, how can I make sure that scrapy uses different IP address for requests? You can yield the first request to check your public IP, and compare this to the IP you see when you go to http://checkip.dyndns.org/ without using Tor/VPN. If they are not the same, scrapy is using a different IP obviously. def start_reqests(): yield Request('http:/