web-crawler

Recommendations for a spidering tool to use with Lucene or Solr? [closed]

可紊 提交于 2019-11-30 06:42:12
What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be. In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy. I've tried every open-source crawler that I can find, and none of them

How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

时间秒杀一切 提交于 2019-11-30 06:35:16
问题 I am running many instances of a webcrawler in parallel. Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain. Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl. I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is

Submit data via web form and extract the results

北战南征 提交于 2019-11-30 06:22:17
问题 My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on

how to use two level proxy setting in Python?

ぐ巨炮叔叔 提交于 2019-11-30 05:35:47
问题 I am working on web-crawler [using python]. Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls. Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through. My question is: This is two level proxy, one to connect to main server-1, I use proxy and then after to shuffle through proxies,

Lucene crawler (it needs to build lucene index)

旧时模样 提交于 2019-11-30 05:33:00
I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example... Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx... What you're asking is two components: Web crawler Lucene-based automated indexer First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could

how to identify web crawlers of google/yahoo/msn by PHP?

寵の児 提交于 2019-11-30 05:32:44
AFAIK, $_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com". but is it the most ensuring method? any other way out? You identify search engines by user agent and IP address . More info can be found in How to identify search engine spiders and webbots . It's also worth noting this list . You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course free to tell you anything. It's trivial to write code to pretend to be Googlebot. In PHP, this means looking

Nutch No agents listed in 'http.agent.name'

烈酒焚心 提交于 2019-11-30 05:08:26
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect

Automated filedownload using WebBrowser without url

风流意气都作罢 提交于 2019-11-30 04:05:02
问题 I've been working on a WebCrawler written in C# using System.Windows.Forms.WebBrowser. I am trying to download a file off a website and save it on a local machine. More importantly, I would like this to be fully automated. The file download can be started by clicking a button that calls a javascript function that sparks the download displaying a “Do you want to open or save this file?” dialog. I definitely do not want to be manually clicking “Save as”, and typing in the file name. I am aware

HttpBrowserCapabilities.Crawler property .NET

自闭症网瘾萝莉.ら 提交于 2019-11-30 03:41:27
问题 How does the HttpBrowserCapabilities.Crawler property (http://msdn.microsoft.com/en-us/library/aa332775(VS.71).aspx) work? I need to detect a partner's custom crawler and this property is returning false. Where/How can I add his user agent so that this property will return true? Any other way outside of creating my own user agent detecting mechanism? 回答1: This is all driven by the default browsercaps declarations that are part of the .NET framework. To setup this specific crawler, you would

Scrapy set depth limit per allowed_domains

有些话、适合烂在心里 提交于 2019-11-30 02:28:12
I am crawling 6 different allowed_domains and would like to limit the depth of 1 domain. How would I go about limiting the depth of that 1 domain in scrapy? Or would it be possible to crawl only 1 depth of an offsite domains? Scrapy doesn't provide anything like this. You can set the DEPTH_LIMIT per-spider , but not per-domain. What can we do? Read the code , drink coffee and solve it (order is important). The idea is to disable Scrapy's built-in DepthMiddleware and provide our custom one instead. First, let's define settings: DOMAIN_DEPTHS would be a dictionary with depth limits per domain