web-crawler

nutch crawling stops after injector.

拈花ヽ惹草 提交于 2019-12-25 06:21:04
问题 here is my cygwin screen looks... cygpath: can't convert empty path Injector: starting at 2014-05-15 16:57:50 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Patch for HADOOP-7682: Instantiating workaround file system Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector:

web crawling tools which support interacting with target sites before begining to crawl

[亡魂溺海] 提交于 2019-12-25 05:38:28
问题 I am looking for a crawler which is capable of handling pages with Ajax and being able to perform certain user interactions with the target site before starting to crawl the site (e.g., clicking on certain menu items, filling some forms, etc...).I tried webdriver/selenium (which are really web scraping tools) and now I am want to know if there is any crawler available that supports emulating certain user interactions before starting to crawl ? (In Java or Python or Ruby ...) Thanks ps - Can

Extending a basic web crawler to filter status codes and HTML

落花浮王杯 提交于 2019-12-25 05:23:11
问题 I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality. At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code? I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library? Here's what I have so far:

Crawl online directories and parse online pdf document to extract text in java

白昼怎懂夜的黑 提交于 2019-12-25 03:43:02
问题 I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/ and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it. I am using files.walk in order to crawl around locally in my laptop, and Apache Tika library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory. Here's the code that goes through my PC and parses the

How to use HTMLAgilityPack to extract HTML data

感情迁移 提交于 2019-12-25 03:15:55
问题 I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method. The search result for example can be found here: Search Result When I look at the HTML source for the result I can see the following: <HR><CENTER><H3>License Information *</H3></CENTER><HR> <P> <CENTER> 06/03/2014 </CENTER> <BR> <B>Name : </B> WILLIAMS AJAYA L <BR> <B>Address : </B> NEW YORK NY <BR> <B>Profession : </B> ATHLETIC

Nutch 2 with Cassandra as a storage is not crawling data properly

我怕爱的太早我们不能终老 提交于 2019-12-25 03:07:43
问题 I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data. Below are the details of different files and output I am getting: ========== command to run crawler ===================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 ======================== seed.txt data ========================== http://www.ft.com ===

Sorting crawled information?

给你一囗甜甜゛ 提交于 2019-12-25 02:45:08
问题 Here is the result of the page I successfully crawled: The problem is that I've only been given numbers! There is no separation. My goal is to separate and sort them. Each of these numbers means something. But let's take the first three. 5553 is the player's rank, 2591 is the player's level, and 1287238956 is the player's experience points. How do I display this information in a format like this (like a table)? Skill Rank Level Experience Overall 5553 2591 1287238956 Here is my

Nutch does not crawl multiple sites

喜你入骨 提交于 2019-12-25 02:35:13
问题 I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this: http://1.a.b/ http://2.a.b/ and my regex-urlfilter.txt looks like this: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs

How can I fetch authenticated data from school homepage?

依然范特西╮ 提交于 2019-12-25 02:25:18
问题 I want to crawl my authenticated data from university homepage and there are no API calls. Therefore, I have to send POST data like id and password to server, but I cannot login without clicking login button. Below is my code of university homepage. <form action="./_login.php" method="post" autocomplete = "off" onSubmit="return comp()" name="login" > <!--<form action="https://hisnet.handong.edu/login/_login.php" method="post" autocomplete = "off" onSubmit="return comp()" name="login" >--> <!-

Design Question for Notification System

╄→гoц情女王★ 提交于 2019-12-25 02:22:26
问题 The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted