web-crawler

Scrapy request not passing to callback when 301?

百般思念 提交于 2019-12-21 22:24:47
问题 I'm trying to update a database full of links to external websites, for some reason, it's skipping the callback when the request headers/website/w/e is moved/301 flag def start_requests(self): #... database stuff for x in xrange(0, numrows): row = cur.fetchone() item = exampleItem() item['real_id'] = row[0] item['product_id'] = row[1] url = "http://www.example.com/a/-" + item['real_id'] + ".htm" log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right request =

Nutch crawl no error , but result is nothing

家住魔仙堡 提交于 2019-12-21 21:39:56
问题 I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost

Comatose web crawler in R (w/ rvest)

别说谁变了你拦得住时间么 提交于 2019-12-21 21:32:54
问题 I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA

Comatose web crawler in R (w/ rvest)

一个人想着一个人 提交于 2019-12-21 21:26:24
问题 I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA

Building a web crawler - using Webkit packages

╄→尐↘猪︶ㄣ 提交于 2019-12-21 18:32:19
问题 I'm trying to build a web crawler. I need 2 things: Convert the HTML into a DOM object. Execute existing JavaScripts on demand. The result I expect is a DOM Object, where the JavaScript that executes on-load is already executed. Also, I need an option to execute on demand additional JavaScripts (on events like: onMouseOver , onMouseClick etc.) First of all, I couldn't find a good documentation source. I searched through Webkit Main Page but couldn't find much information for users of the

Where is the crawled data stored when running nutch crawler?

不羁岁月 提交于 2019-12-21 17:58:26
问题 I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed. Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format? Versions apache-nutch-1.9 solr-4.10.4 回答1: After your

How to best develop web crawlers

北慕城南 提交于 2019-12-21 05:47:07
问题 I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed , tr , awk or other utilities to clean the page and grab the specific info I need. All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX

Web Cralwer Algorithm: depth?

空扰寡人 提交于 2019-12-21 05:32:06
问题 I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial depth indicates the link depth from the root page that should be crawled. So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think. 回答1: Link depth means the number of "hops" a page is be away from the root,

Scraping text in h3 and div tags using beautifulSoup, Python

早过忘川 提交于 2019-12-21 05:01:05
问题 I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i>  NAME</div> <div><i class="fa phone"></i>  MOBILE</div> <div><i class="fa mobile-phone fa-2"></i>   NUMBER</div> <div><i class="fa address"></i>   XYZ_ADDRESS</div> <div class=

Equivalent of wget in Python to download website and resources

匆匆过客 提交于 2019-12-21 04:51:09
问题 Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing. I want to download everything on a page to make it possible to view it just from the files. The command wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows does exactly that I need. However we want to be able to tie it in with other