web-crawler

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

我怕爱的太早我们不能终老 提交于 2019-12-25 10:36:04
问题 I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following: <property> <name>html.metatitle.keys</name> <value>movie,actor,firm</value> <description> </description> </property> 回答1: There are two different solutions available for your problem: Implementing a customized HtmlParseFilter plugin to filter pages based on your desired keywords. For more information about Nutch extension points and writing customized plugin for

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

允我心安 提交于 2019-12-25 10:32:05
问题 I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following: <property> <name>html.metatitle.keys</name> <value>movie,actor,firm</value> <description> </description> </property> 回答1: There are two different solutions available for your problem: Implementing a customized HtmlParseFilter plugin to filter pages based on your desired keywords. For more information about Nutch extension points and writing customized plugin for

How to use rvest to web crawling correctly?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-25 10:02:42
问题 I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code url<- read_html("http://www.funda.nl/en/koop/leiden/") url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data- pagination-page") %>% as.numeric() However, what I got is numeric(0) . If I remove as.numeric() , I get character(0) . How is this done ? 回答1: I believe that

Python crawler does not work properly

跟風遠走 提交于 2019-12-25 09:27:32
问题 I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this: headers = { 'Referer': 'https://freemidi.org/download-20225',

Login and submit form with web-crawler

别说谁变了你拦得住时间么 提交于 2019-12-25 08:59:47
问题 So in web-crawler I pass and submit data like this $client = new Client(); $crawler = $client->request('GET', 'link'); $form = $crawler->filter('.default')->form(); $crawler = $client->submit($form, array( 'login'=>'ud', 'password'=>'pw' )); But if I use var_dump($crawler); I realise that I never get data from the website after login because it redirects me and var_dump takes data from the page where I submited. I want after login to move to the new link to submit a form $client-

'str' object has no attribute 'p' using beautifulsoup

霸气de小男生 提交于 2019-12-25 08:14:41
问题 I have been following a tutorial on using BeautifulSoup, however when I try to read the title or even paragraphs (using soup.p) I get an error saying, "Traceback (most recent call last): File "*****/Tutorial1.py", line 9, in pTag = soup.p AttributeError: 'str' object has no attribute 'p'" I am still very new to Python, sorry to bother if these is too much of an easy issue but I will greatly appreciate any help. Code given below: import urllib.request from bs4 import BeautifulSoup with urllib

I need help making a website crawler using php [closed]

帅比萌擦擦* 提交于 2019-12-25 07:47:37
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I really want to make a website crawler that goes to a website, scans it for links, puts the links in a database and moves on to another website. I found one website but the code was really buggy. If you have

Web crawler that can interpret JavaScript [closed]

天涯浪子 提交于 2019-12-25 07:39:11
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM

Nutch Crawler error: Premission denied

随声附和 提交于 2019-12-25 06:59:05
问题 I am trying to run a basic crawler. Got the command from the NutchTutorial: bin/crawl urls -dir crawl -depth 3 -topN 5 (after doing all the presets) Im running from windows so I've installed cygwin64 as a running environment I don't see any problems when I run bin/nutch from the nutch home directory, but when I try to run the crawl as above I get the following error: Injector: starting at 2014-11-29 11:31:35 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected

apache nutch crawler - keeps retrieve only single url

倖福魔咒の 提交于 2019-12-25 06:57:48
问题 INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value? <configuration> <property> <name>http.agent.name</name> <value>crawler1</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>solr.server.url</name>