web-crawler | 易学教程

C# web and ftp crawler library

阅读更多关于 C# web and ftp crawler library

问题 I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, I'm happy with reading HTML, I want to extend it to PDF, WORD, etc.. I'm happy with a starter's open source software or at least any directions for documentation. 回答1: Check NCrawler project Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google).

Unable to access request.response.meta['redirect_urls'] from Scrapy script

阅读更多关于 Unable to access request.response.meta['redirect_urls'] from Scrapy script

问题 I am unable to access request.response.meta['redirect_urls'] from my Scrapy script, but have no problem accessing this information for the same webpage in the Scrapy shell. When I print the keys of request.response.meta I only see download_timeout, depth, download_latency, download_slot I am wondering if this is to do with one of the settings I have modified in my Scrapy script which contains the following: settings.set('DEPTH_LIMIT', 4) settings.set('DOWNLOAD_DELAY', 1) settings.set('USER

Strange exceptions on production website from HTTP_USER_AGENT Java/1.6.0_17

阅读更多关于 Strange exceptions on production website from HTTP_USER_AGENT Java/1.6.0_17

问题 Today we have received some strange exceptions on our production website. They all have the following HTTP_USER_AGENT string: Java/1.6.0_17 . I looked it up over at UserAgentString.com but the info is quite useless. Here's one of the exceptions we're getting (they are all more or less the same): System.NotSupportedException: The given path's format is not supported. The path that is being queried: /klacht/Scripts/,data:c,complete:function(a,b,c){c=a.responseText,a.isResolved()&&(a.done

Any way to tell selenium don't execute js at some point?

阅读更多关于 Any way to tell selenium don't execute js at some point?

问题 I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse). my code: from selenium import webdriver driver = webdriver.PhantomJS() driver.set_window_size(1120, 550) driver.get(url) trs = driver.find_elements_by_css_selector('.table tbody tr') print len(trs) for tr in trs: try: items.append(tr.text) except: # because the js update content, so this tr is missing pass print len(items) len(items)

HTML Snapshot for crawler - Understanding how it works

阅读更多关于 HTML Snapshot for crawler - Understanding how it works

问题 i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point. I want understand if i have understood :) I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now... 1 - Where i put that gethtmlsnapshot.php? In other words, i

Scrapy get website with error “DNS lookup failed”

阅读更多关于 Scrapy get website with error “DNS lookup failed”

问题 I'm trying to use Scrapy to get all links on websites where the "DNS lookup failed". The problem is, every website without any errors are print on the parse_obj method but when an url return DNS lookup failed, the callback parse_obj is not call . I want to get all domain with the error " DNS lookup failed ", how can I do that ? Logs : 2016-03-08 12:55:12 [scrapy] INFO: Spider opened 2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03

crawling website that needs authentication

阅读更多关于 crawling website that needs authentication

问题 How would I write a simple script (in cURL/python/ruby/bash/perl/java) that logs in to okcupid and tallies how many messages I've received each day? The output will be something like: 1/21/2011 1 messages 1/22/2011 0 messages 1/23/2011 2 messages 1/24/2011 1 messages The main issue is that I have never written a web crawler before. I have no idea how to programmatically log in to a site like okcupid. How do you make the authentication persist while loading different pages? etc.. Once I get

Scrapy process.crawl() to export data to json

阅读更多关于 Scrapy process.crawl() to export data to json

问题 This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one. Here's my problem : I cannot use scrapy crawl mySpider -a start_urls(myUrl) -o myData.json Instead i want/need to use crawlerProcess.crawl(spider) I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell

Scrap multiple accounts aka multiple logins

阅读更多关于 Scrap multiple accounts aka multiple logins

问题 I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? 回答1: you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY

How much is the difference between html parsing and web crawling in python [closed]

阅读更多关于 How much is the difference between html parsing and web crawling in python [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . I need to grab some data from websites in my django website. Now i am confused whether i should use python parsing libraries or web crawling libraries. Does search engine libraries also fall in same category I want to know how much is the difference between the two and if i want to