scrapy | 易学教程

scrapy text encoding

阅读更多关于 scrapy text encoding

问题 Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class vriskoSpider(CrawlSpider): name = 'vrisko' allowed_domains = ['vrisko.gr'] start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF'] rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse

python install lxml on mac os 10.10.1

阅读更多关于 python install lxml on mac os 10.10.1

问题 I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started to be downloading and many text was written on the terminal, but i got this error message on red in the terminal 1 error generated. error: command '/usr/bin/clang' failed with exit status 1 ---------------------------------------- Cleaning up... Command

Scrapy Python Set up User Agent

阅读更多关于 Scrapy Python Set up User Agent

问题 I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" [deploy] #url = http://localhost:6800/ project = myproject But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (

记一次Mac OS安装scrapy报错的经历

阅读更多关于记一次Mac OS安装scrapy报错的经历

事情：用pip为Python3安装scrapy时，执行以下命令 python3 -m pip install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com 报错，一切指向一个关键词“gcc” error: command 'gcc' failed with exit status 1 试了以下的办法：在终端中执行（前提是已经按照了homebrew） brew search gcc5 结果提示 Error: You have not agreed to the Xcode license. Please resolve this by running: sudo xcodebuild -license accept 嗯？要同意xcode的用户协议？于是把电脑上的xcode打开，果然先要先接受用户协议才能用，一点接受，然后再尝试用pip按照。然后就顺利成功了，根本不需要按网络上各种教程去安装或者升级gcc。如果大家在Mac 下pip遇到gcc的错误，不妨尝试打开你们的xcode，看看是不是用户协议的问题。来源： CSDN 作者： iCheer-xu 链接： https://blog.csdn.net/qq_36071963/article/details/103578696

Get document DOCTYPE with BeautifulSoup

阅读更多关于 Get document DOCTYPE with BeautifulSoup

问题 I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the following html: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <meta charset=utf-8 /> <meta name="viewport" content="width=620" /> <title>HTML5 Demos and Examples</title>

Python Scrapy: Convert relative paths to absolute paths

阅读更多关于 Python Scrapy: Convert relative paths to absolute paths

问题 I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here. from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.response import get_base_url from scrapy.utils.url import urljoin_rfc from dmoz2.items import DmozItem class DmozSpider(BaseSpider): name = "namastecopy2" allowed_domains = ["namastefoods.com"] start_urls = [ "http://www.namastefoods.com/products/cgi-bin/products.cgi

Python爬虫Scrapy框架：Request

阅读更多关于 Python爬虫Scrapy框架：Request

一 .Request 1.request Scarpy中的HTTP请求对象 1.1.Requse的构造 #我们ctrl+左键可以看到Scarpy.Request的代码 class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None): 其中我们可以看出描述他的有这几个因素 1.url:请求页面的地址(必须有这个参数) 2.callback:页面解析参数,(默认调用Spider的parse的方法) 3.method:http的请求方法,默认为'GET' 4.header:请求头部字典,NONE是不发生送给COOKIES 5.body:请求正文,bytes或者str数据类型 6.cookies:COOKIES信息字典 7meta:(我没法理解) 8.encoding:编码方式 9.priority:请求优先级,默认值为0 10.dont_filter:默认情况下是False对同一url发送多次请求不过会被过滤

Scrapy: Follow link to get additional Item data?

阅读更多关于 Scrapy: Follow link to get additional Item data?

问题 I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right? Ultimately I want to scrape the Title , Due Date , and Details for each row. Title and Due Date are immediately available on the page... BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that

Access django models inside of Scrapy

阅读更多关于 Access django models inside of Scrapy

问题 Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up? 回答1: If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module('settings', [path]) project = imp.load_module('settings', f,

Click a Button in Scrapy

阅读更多关于 Click a Button in Scrapy

问题 I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need. How can I simply click a button, which then shows the information I need? Do I have to use an external library like mechanize or lxml? 回答1: Scrapy cannot interpret