web-crawler | 易学教程

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

阅读更多关于 Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

问题 I wanna develop a topical web robot using Nutch 2.2.1. And I wanna create a new property with some topic keywords,like following: <property> <name>html.metatitle.keys</name> <value>movie,actor,firm</value> <description> </description> </property> 回答1: There are two different solutions available for your problem: Implementing a customized HtmlParseFilter plugin to filter pages based on your desired keywords. For more information about Nutch extension points and writing customized plugin for

Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

阅读更多关于 Nutch 2.2.1 & HBase - Can I create a new property in nutch-site.xml

How to use rvest to web crawling correctly?

阅读更多关于 How to use rvest to web crawling correctly?

问题 I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code url<- read_html("http://www.funda.nl/en/koop/leiden/") url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data- pagination-page") %>% as.numeric() However, what I got is numeric(0) . If I remove as.numeric() , I get character(0) . How is this done ? 回答1: I believe that

Python crawler does not work properly

阅读更多关于 Python crawler does not work properly

问题 I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this: headers = { 'Referer': 'https://freemidi.org/download-20225',

Login and submit form with web-crawler

阅读更多关于 Login and submit form with web-crawler

问题 So in web-crawler I pass and submit data like this $client = new Client(); $crawler = $client->request('GET', 'link'); $form = $crawler->filter('.default')->form(); $crawler = $client->submit($form, array( 'login'=>'ud', 'password'=>'pw' )); But if I use var_dump($crawler); I realise that I never get data from the website after login because it redirects me and var_dump takes data from the page where I submited. I want after login to move to the new link to submit a form $client-

'str' object has no attribute 'p' using beautifulsoup

阅读更多关于 'str' object has no attribute 'p' using beautifulsoup

问题 I have been following a tutorial on using BeautifulSoup, however when I try to read the title or even paragraphs (using soup.p) I get an error saying, "Traceback (most recent call last): File "*****/Tutorial1.py", line 9, in pTag = soup.p AttributeError: 'str' object has no attribute 'p'" I am still very new to Python, sorry to bother if these is too much of an easy issue but I will greatly appreciate any help. Code given below: import urllib.request from bs4 import BeautifulSoup with urllib

I need help making a website crawler using php [closed]

阅读更多关于 I need help making a website crawler using php [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I really want to make a website crawler that goes to a website, scans it for links, puts the links in a database and moves on to another website. I found one website but the code was really buggy. If you have

Web crawler that can interpret JavaScript [closed]

阅读更多关于 Web crawler that can interpret JavaScript [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM

Nutch Crawler error: Premission denied

阅读更多关于 Nutch Crawler error: Premission denied

问题 I am trying to run a basic crawler. Got the command from the NutchTutorial: bin/crawl urls -dir crawl -depth 3 -topN 5 (after doing all the presets) Im running from windows so I've installed cygwin64 as a running environment I don't see any problems when I run bin/nutch from the nutch home directory, but when I try to run the crawl as above I get the following error: Injector: starting at 2014-11-29 11:31:35 Injector: crawlDb: -dir/crawldb Injector: urlDir: urls Injector: Converting injected

apache nutch crawler - keeps retrieve only single url

阅读更多关于 apache nutch crawler - keeps retrieve only single url

问题 INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value? <configuration> <property> <name>http.agent.name</name> <value>crawler1</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>solr.server.url</name>