web-crawler | 易学教程

Scrapy request not passing to callback when 301?

阅读更多关于 Scrapy request not passing to callback when 301?

问题 I'm trying to update a database full of links to external websites, for some reason, it's skipping the callback when the request headers/website/w/e is moved/301 flag def start_requests(self): #... database stuff for x in xrange(0, numrows): row = cur.fetchone() item = exampleItem() item['real_id'] = row[0] item['product_id'] = row[1] url = "http://www.example.com/a/-" + item['real_id'] + ".htm" log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right request =

Nutch crawl no error , but result is nothing

阅读更多关于 Nutch crawl no error , but result is nothing

问题 I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned　folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost

Comatose web crawler in R (w/ rvest)

阅读更多关于 Comatose web crawler in R (w/ rvest)

问题 I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA

Comatose web crawler in R (w/ rvest)

阅读更多关于 Comatose web crawler in R (w/ rvest)

Building a web crawler - using Webkit packages

阅读更多关于 Building a web crawler - using Webkit packages

问题 I'm trying to build a web crawler. I need 2 things: Convert the HTML into a DOM object. Execute existing JavaScripts on demand. The result I expect is a DOM Object, where the JavaScript that executes on-load is already executed. Also, I need an option to execute on demand additional JavaScripts (on events like: onMouseOver , onMouseClick etc.) First of all, I couldn't find a good documentation source. I searched through Webkit Main Page but couldn't find much information for users of the

Where is the crawled data stored when running nutch crawler?

阅读更多关于 Where is the crawled data stored when running nutch crawler?

问题 I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed. Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format? Versions apache-nutch-1.9 solr-4.10.4 回答1: After your

How to best develop web crawlers

阅读更多关于 How to best develop web crawlers

问题 I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed , tr , awk or other utilities to clean the page and grab the specific info I need. All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX

Web Cralwer Algorithm: depth?

阅读更多关于 Web Cralwer Algorithm: depth?

问题 I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial depth indicates the link depth from the root page that should be crawled. So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think. 回答1: Link depth means the number of "hops" a page is be away from the root,

Scraping text in h3 and div tags using beautifulSoup, Python

阅读更多关于 Scraping text in h3 and div tags using beautifulSoup, Python

问题 I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i> NAME</div> <div><i class="fa phone"></i> MOBILE</div> <div><i class="fa mobile-phone fa-2"></i> NUMBER</div> <div><i class="fa address"></i> XYZ_ADDRESS</div> <div class=

Equivalent of wget in Python to download website and resources

阅读更多关于 Equivalent of wget in Python to download website and resources

问题 Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing. I want to download everything on a page to make it possible to view it just from the files. The command wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows does exactly that I need. However we want to be able to tie it in with other