web-crawler

C# web and ftp crawler library

邮差的信 提交于 2019-12-07 14:49:36
问题 I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, I'm happy with reading HTML, I want to extend it to PDF, WORD, etc.. I'm happy with a starter's open source software or at least any directions for documentation. 回答1: Check NCrawler project Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google).

Unable to access request.response.meta['redirect_urls'] from Scrapy script

混江龙づ霸主 提交于 2019-12-07 14:42:44
问题 I am unable to access request.response.meta['redirect_urls'] from my Scrapy script, but have no problem accessing this information for the same webpage in the Scrapy shell. When I print the keys of request.response.meta I only see download_timeout, depth, download_latency, download_slot I am wondering if this is to do with one of the settings I have modified in my Scrapy script which contains the following: settings.set('DEPTH_LIMIT', 4) settings.set('DOWNLOAD_DELAY', 1) settings.set('USER

Strange exceptions on production website from HTTP_USER_AGENT Java/1.6.0_17

為{幸葍}努か 提交于 2019-12-07 14:05:19
问题 Today we have received some strange exceptions on our production website. They all have the following HTTP_USER_AGENT string: Java/1.6.0_17 . I looked it up over at UserAgentString.com but the info is quite useless. Here's one of the exceptions we're getting (they are all more or less the same): System.NotSupportedException: The given path's format is not supported. The path that is being queried: /klacht/Scripts/,data:c,complete:function(a,b,c){c=a.responseText,a.isResolved()&&(a.done

Any way to tell selenium don't execute js at some point?

余生颓废 提交于 2019-12-07 14:00:09
问题 I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse). my code: from selenium import webdriver driver = webdriver.PhantomJS() driver.set_window_size(1120, 550) driver.get(url) trs = driver.find_elements_by_css_selector('.table tbody tr') print len(trs) for tr in trs: try: items.append(tr.text) except: # because the js update content, so this tr is missing pass print len(items) len(items)

HTML Snapshot for crawler - Understanding how it works

假如想象 提交于 2019-12-07 12:12:40
问题 i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point. I want understand if i have understood :) I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now... 1 - Where i put that gethtmlsnapshot.php? In other words, i

Scrapy get website with error “DNS lookup failed”

孤街醉人 提交于 2019-12-07 11:41:58
问题 I'm trying to use Scrapy to get all links on websites where the "DNS lookup failed". The problem is, every website without any errors are print on the parse_obj method but when an url return DNS lookup failed, the callback parse_obj is not call . I want to get all domain with the error " DNS lookup failed ", how can I do that ? Logs : 2016-03-08 12:55:12 [scrapy] INFO: Spider opened 2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03

crawling website that needs authentication

时光怂恿深爱的人放手 提交于 2019-12-07 11:23:53
问题 How would I write a simple script (in cURL/python/ruby/bash/perl/java) that logs in to okcupid and tallies how many messages I've received each day? The output will be something like: 1/21/2011 1 messages 1/22/2011 0 messages 1/23/2011 2 messages 1/24/2011 1 messages The main issue is that I have never written a web crawler before. I have no idea how to programmatically log in to a site like okcupid. How do you make the authentication persist while loading different pages? etc.. Once I get

Scrapy process.crawl() to export data to json

血红的双手。 提交于 2019-12-07 09:57:05
问题 This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one. Here's my problem : I cannot use scrapy crawl mySpider -a start_urls(myUrl) -o myData.json Instead i want/need to use crawlerProcess.crawl(spider) I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell

Scrap multiple accounts aka multiple logins

匆匆过客 提交于 2019-12-07 09:42:33
问题 I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? 回答1: you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY

How much is the difference between html parsing and web crawling in python [closed]

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-07 09:16:09
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . I need to grab some data from websites in my django website. Now i am confused whether i should use python parsing libraries or web crawling libraries. Does search engine libraries also fall in same category I want to know how much is the difference between the two and if i want to