web-crawler | 易学教程

What is a good Web search and web crawling engine for Java?

阅读更多关于 What is a good Web search and web crawling engine for Java?

问题 I am working on an application where I need to integrate the search engine. This should do crawling also. Please suggest a good Java based search engine. Thank you in advance. 回答1: Nutch (Lucene) is an Open Source engine which should satisfy your needs. 回答2: In the past I worked with terrier, a search engine written in Java: Terrier is a highly flexible, efficient, effective, and robust search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the

Scrapy InIt self.initialized() — not initializing

阅读更多关于 Scrapy InIt self.initialized() — not initializing

I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib

iPhone: How to download a full website?

阅读更多关于 iPhone: How to download a full website?

问题 what approach do you recommend me for downloading a website (one HTML site with all included images) to the iPhone? The question is how to crawl all those tiny bits (Javascripts, images, CSS) and save them locally. It's not about the concrete implementation (I know how to use NSURLRequest and stuff. I'm looking for a crawl/spider approach). Jail breaks won't work, since it is intended for an official (App Store) app. Regards, Stefan 回答1: Downloading? Or getting the HTML-source of the site and

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

阅读更多关于 Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

问题 Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits). I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working. This similar to

java html parser for reading javascript generated contents

阅读更多关于 java html parser for reading javascript generated contents

I am using jsoup for reading a web page by the following function. public Document getDocuement(String url){ Document doc = null; try { doc = Jsoup.connect(url).timeout(20*1000).userAgent("Mozilla").get(); } catch (Exception e) { return null; } return doc; } But whenever i am trying to read a web page that contain javascript generated contents, jsoup does not read those contents. ie, the actual content of the page is loading by some javascript calls.So it is not present in the page source of that link. For example, this blog: http://blog.rapporter.net/search/label/r . Is there a way to get

How to recursively crawl subpages with Scrapy

阅读更多关于 How to recursively crawl subpages with Scrapy

So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name data from subcategory n's page etc. Eventually i want to be able to use this data with ElasticSearch I

Sharepoint 2010 search cannot crawl mediawiki site

阅读更多关于 Sharepoint 2010 search cannot crawl mediawiki site

问题 Using Sharepoint 2010 enterprise search, we are trying to crawl our internal mediawiki based wiki site. Search fails with error : 'The URL was permanently moved. ( URL redirected to ... )'. Since the wiki site has case sensitive URLs, when Sharepoint 2010 tries to crawl with lower case URL names, the Wiki says 'page does not exists' and redirects with 301 !!! Any got a solution ? Thanks in advance. 回答1: By default, all links crawled are converted to lower case by the SharePoint search indexer

How to read the content of an website?

阅读更多关于 How to read the content of an website?

I'm new on web-crawler using python 2.7. 1. Background Now, I want to collect useful data from AQICN.org which is a great website offering the air quality data all over the world. I want to use python to get all China's sites data per hour. But I'm stuck right now. 2. My trouble Take this website( http://aqicn.org/city/shenyang/usconsulate/ ) for example. This page offer the air pollution and meteorology parameters of a U.S Consulate in China. Using code like this, I can't get useful information. import urllib from bs4 import BeautifulSoup import re import json html_aqi = urllib.urlopen("http:

Scrapy spider difference between Crawled pages and Scraped items

阅读更多关于 Scrapy spider difference between Crawled pages and Scraped items

问题 Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details. It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log: Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min) I'm not understanding the reason of this big difference between Crawled pages and Scraped

C# web and ftp crawler library

阅读更多关于 C# web and ftp crawler library

I need a library (hopefully in C#!) which works as a web crawler to access HTTP files and FTP files. In principle, I'm happy with reading HTML, I want to extend it to PDF, WORD, etc.. I'm happy with a starter's open source software or at least any directions for documentation. Nick Martyshchenko Check NCrawler project Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google). Easy to add pipeline steps to extract, use and alter information. I have developed the