web-crawler | 易学教程

How to build a Python crawler for websites using oauth2

阅读更多关于 How to build a Python crawler for websites using oauth2

I'm new in web programming. I want to build a crawler for crawling the social graph in Foursquare by Python. I've got a "manually" controlled crawler by using the apiv2 library. The main method is like: def main(): CODE = "******" url = "https://foursquare.com/oauth2/authenticate?client_id=****&response_type=code&redirect_uri=****" key = "***" secret = "****" re_uri = "***" auth = apiv2.FSAuthenticator(key, secret, re_uri) auth.set_token(code) finder = apiv2.UserFinder(auth) #DO SOME REQUIRES By USING THE FINDER finder.finde(ANY_USER_ID).mayorships() bla bla bla The problem is that at present,

How to control the order of yield in Scrapy

阅读更多关于 How to control the order of yield in Scrapy

Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json , and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)['group_members'] for i in groups: group_item = GroupItem() group_item['name'] = groups[i]['name']

Downloading all pdf files from google scholar search results using wget

阅读更多关于 Downloading all pdf files from google scholar search results using wget

问题 I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research. I have read the following pages on stackoverflow: Crawl website using wget and limit total number of crawled links How do web spiders differ from Wget's spider? Downloading all PDF files from a website How to download all files (but not HTML) from a website using wget? The last page is probably the most inspirational of all.

How do I remove a query from a url?

阅读更多关于 How do I remove a query from a url?

问题 I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop. How do i make scrapy to neglect the query string part of the URL's? 回答1: See urllib.urlparse Example code: from urlparse import urlparse o = urlparse('http://url.something.com/bla.html?querystring=stuff') url_without_query_string = o.scheme + "://" + o.netloc + o.path Example output: Python 2.6.1 (r261:67515, Jun 24

Where is the crawled data stored when running nutch crawler?

阅读更多关于 Where is the crawled data stored when running nutch crawler?

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed. Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format? Versions apache-nutch-1.9 solr-4.10.4 After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format. The

How to Programmatically take Snapshot of Crawled Webpages (in Ruby)?

阅读更多关于 How to Programmatically take Snapshot of Crawled Webpages (in Ruby)?

问题 What is the best solution to programmatically take a snapshot of a webpage? The situation is this: I would like to crawl a bunch of webpages and take thumbnail snapshots of them periodically, say once every few months, without having to manually go to each one. I would also like to be able to take jpg/png snapshots of websites that might be completely Flash/Flex, so I'd have to wait until it loaded to take the snapshot somehow. It would be nice if there was no limit to the number of

Building a web crawler - using Webkit packages

阅读更多关于 Building a web crawler - using Webkit packages

I'm trying to build a web crawler. I need 2 things: Convert the HTML into a DOM object. Execute existing JavaScripts on demand. The result I expect is a DOM Object, where the JavaScript that executes on-load is already executed. Also, I need an option to execute on demand additional JavaScripts (on events like: onMouseOver , onMouseClick etc.) First of all, I couldn't find a good documentation source. I searched through Webkit Main Page but couldn't find much information for users of the package, and no usefull code examples. Also, in some forums I've seen instructions not to use the Webkit

Is Erlang the right choice for a webcrawler?

阅读更多关于 Is Erlang the right choice for a webcrawler?

问题 I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database. The language and plattform used for the crawler have to match the following criteria: easily scalable on multiple cores and cpus suited for high I/O loads fast regular expression matching

Change IP address dynamically?

阅读更多关于 Change IP address dynamically?

问题 Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit. So, how can change my IP address dynamically or any other ideas? 回答1: An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware . Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py : DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90, 'tutorial

How to crawl thousands of pages using scrapy?

阅读更多关于 How to crawl thousands of pages using scrapy?

问题 I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB. Any ideas on how to do this with scrapy if possible? 回答1: If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach: Crawler - a scrapy