web-crawler | 易学教程

nutch crawling stops after injector.

阅读更多关于 nutch crawling stops after injector.

问题 here is my cygwin screen looks... cygpath: can't convert empty path Injector: starting at 2014-05-15 16:57:50 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Patch for HADOOP-7682: Instantiating workaround file system Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector:

web crawling tools which support interacting with target sites before begining to crawl

阅读更多关于 web crawling tools which support interacting with target sites before begining to crawl

问题 I am looking for a crawler which is capable of handling pages with Ajax and being able to perform certain user interactions with the target site before starting to crawl the site (e.g., clicking on certain menu items, filling some forms, etc...).I tried webdriver/selenium (which are really web scraping tools) and now I am want to know if there is any crawler available that supports emulating certain user interactions before starting to crawl ? (In Java or Python or Ruby ...) Thanks ps - Can

Extending a basic web crawler to filter status codes and HTML

阅读更多关于 Extending a basic web crawler to filter status codes and HTML

问题 I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality. At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code? I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library? Here's what I have so far:

Crawl online directories and parse online pdf document to extract text in java

阅读更多关于 Crawl online directories and parse online pdf document to extract text in java

问题 I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/ and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it. I am using files.walk in order to crawl around locally in my laptop, and Apache Tika library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory. Here's the code that goes through my PC and parses the

How to use HTMLAgilityPack to extract HTML data

阅读更多关于 How to use HTMLAgilityPack to extract HTML data

问题 I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method. The search result for example can be found here: Search Result When I look at the HTML source for the result I can see the following: <HR><CENTER><H3>License Information *</H3></CENTER><HR> <P> <CENTER> 06/03/2014 </CENTER> <BR> <B>Name : </B> WILLIAMS AJAYA L <BR> <B>Address : </B> NEW YORK NY <BR> <B>Profession : </B> ATHLETIC

Nutch 2 with Cassandra as a storage is not crawling data properly

阅读更多关于 Nutch 2 with Cassandra as a storage is not crawling data properly

问题 I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data. Below are the details of different files and output I am getting: ========== command to run crawler ===================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 ======================== seed.txt data ========================== http://www.ft.com ===

Sorting crawled information?

阅读更多关于 Sorting crawled information?

问题 Here is the result of the page I successfully crawled: The problem is that I've only been given numbers! There is no separation. My goal is to separate and sort them. Each of these numbers means something. But let's take the first three. 5553 is the player's rank, 2591 is the player's level, and 1287238956 is the player's experience points. How do I display this information in a format like this (like a table)? Skill Rank Level Experience Overall 5553 2591 1287238956 Here is my

Nutch does not crawl multiple sites

阅读更多关于 Nutch does not crawl multiple sites

问题 I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this: http://1.a.b/ http://2.a.b/ and my regex-urlfilter.txt looks like this: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs

How can I fetch authenticated data from school homepage?

阅读更多关于 How can I fetch authenticated data from school homepage?

问题 I want to crawl my authenticated data from university homepage and there are no API calls. Therefore, I have to send POST data like id and password to server, but I cannot login without clicking login button. Below is my code of university homepage. <form action="./_login.php" method="post" autocomplete = "off" onSubmit="return comp()" name="login" >  <!-

Design Question for Notification System

阅读更多关于 Design Question for Notification System

问题 The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted