web-crawler | 易学教程

UnicodeError: URL contains non-ASCII characters (Python 2.7)

阅读更多关于 UnicodeError: URL contains non-ASCII characters (Python 2.7)

问题 So I've managed to make a crawler, and I'm searchng for all links and when I arrive at a product link I make some finds and I take all product information, but when it arrives to certain page it gives a unicode error :/ import urllib import urlparse from itertools import ifilterfalse from urllib2 import URLError, HTTPError from bs4 import BeautifulSoup urls = ["http://www.kiabi.es/"] visited = [] def get_html_text(url): try: return urllib.urlopen(current_url).read() except (URLError,

Element not found in the cache - perhaps the page has changed since it was looked up in Selenium Ruby web driver?

阅读更多关于 Element not found in the cache - perhaps the page has changed since it was looked up in Selenium Ruby web driver?

问题 I am trying to write a crawler that crawls all links from loaded page and logs all request and response headers along with response body in some file say XML or txt. I am opening all links from first loaded page in new browser window so I wont get this error: Element not found in the cache - perhaps the page has changed since it was looked up I want to know what could be the alternate way to make requests and receive response from all links and then locate input elements and submit buttons

Scrapy LinkExtractor - Limit the number of pages crawled per URL

阅读更多关于 Scrapy LinkExtractor - Limit the number of pages crawled per URL

问题 I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the limit is reached, the spider should move to the next start_url. I know there is the DEPTH_LIMIT parameter on setting but this is not what I am looking for. Any help will be useful. Here is the code I currently have: class MySpider(CrawlSpider): name = 'test' allowed_domains = domainvarwebsite

Crawling dynamic content with scrapy

阅读更多关于 Crawling dynamic content with scrapy

问题 I am trying to get latest review from Google play store. I'm following this question for getting the latest reviews here Method specified in the above link's answer works fine with scrapy shell but when I try this in my crawler it gets completely ignored. Code snippet: import re import sys import time import urllib import urlparse from scrapy import Spider from scrapy.spider import BaseSpider from scrapy.http import Request, FormRequest from scrapy.contrib.spiders import CrawlSpider, Rule

How to assign specific sitemaps for specific crawler-bots in robots.txt?

阅读更多关于 How to assign specific sitemaps for specific crawler-bots in robots.txt?

问题 Since some crawlers don't like the sitemap versions made for Google, I made different sitemaps. And there is an option to put Sitemap: http://example.com/sitemap.xml to robots.txt. But is it possible to put it kinda like this: User-agent: * Sitemap: http://example.com/sitemap.xml User-agent: googlebot Sitemap: http://example.com/sitemap-for-google.xml I couldn't find any resource for this topic and robots.txt is not something I want to joke around with. 回答1: This is not possible in robots.txt

Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

阅读更多关于 Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

问题 I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers. Thanks in advance. 回答1: Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers

Does Google's crawler index asynchronously loaded elements?

阅读更多关于 Does Google's crawler index asynchronously loaded elements?

问题 I've built some widget for websites which is asynchronously loaded after the page is loaded: <html> <head>...</head> <body> <div>...</div> <script type="text/javascript"> (function(){ var ns = document.createElement("script"); ns.type = "text/javascript"; ns.async = true; ns.src = "http://mydomain.com/myjavascript.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(ns, s); })(); </script> </body> </html> Is there anyway to notify Google's crawler to index the

Extract Span tag data using Jsoup

阅读更多关于 Extract Span tag data using Jsoup

问题 I am trying to extract the specific content in html using Jsoup. Below is the sample html content. <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class=""> <div class="shop-section line bmargin10 tmargin10"> <div class="price-section fksk-price-section unit"> <div class="price-table"> <div class="line" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer"> <div class="price-save"> <span class="label-td"><span class="label fksk-label">Price :</span></span> </div>

How to exclude part of a web page from google's indexing?

阅读更多关于 How to exclude part of a web page from google's indexing?

问题 There's a way of excluding complete page(s) from google's indexing. But is there a way to specifically exclude certain part(s) of a web page from google's crawling? For example, exclude the side-bar which usually contains unrelated contents? 回答1: You can include with an IFRAME tag the part of the page that you want hide at Googlebot and block the indexing of the file included from the robots.txt file. add the iframe for include the side-bar in your page <iframe src ="sidebar.asp" width="100%"

Increase number of threads in crawler

阅读更多关于 Increase number of threads in crawler

问题 This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); /* * You should implement this function to specify * whether the given URL should be visited or not. */ public boolean shouldVisit(WebURL url) { String href = url.getURL()