scrapy

scrapy text encoding

无人久伴 提交于 2019-12-17 17:29:10
问题 Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class vriskoSpider(CrawlSpider): name = 'vrisko' allowed_domains = ['vrisko.gr'] start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF'] rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse

python install lxml on mac os 10.10.1

你离开我真会死。 提交于 2019-12-17 16:35:45
问题 I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started to be downloading and many text was written on the terminal, but i got this error message on red in the terminal 1 error generated. error: command '/usr/bin/clang' failed with exit status 1 ---------------------------------------- Cleaning up... Command

Scrapy Python Set up User Agent

被刻印的时光 ゝ 提交于 2019-12-17 15:53:42
问题 I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" [deploy] #url = http://localhost:6800/ project = myproject But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (

记一次Mac OS安装scrapy报错的经历

为君一笑 提交于 2019-12-17 15:39:46
事情:用pip为Python3安装scrapy时,执行以下命令 python3 -m pip install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com 报错,一切指向一个关键词“gcc” error: command 'gcc' failed with exit status 1 试了以下的办法: 在终端中执行(前提是已经按照了homebrew) brew search gcc5 结果提示 Error: You have not agreed to the Xcode license. Please resolve this by running: sudo xcodebuild -license accept 嗯?要同意xcode的用户协议?于是把电脑上的xcode打开,果然先要先接受用户协议才能用,一点接受,然后再尝试用pip按照。然后就顺利成功了,根本不需要按网络上各种教程去安装或者升级gcc。如果大家在Mac 下pip遇到gcc的错误,不妨尝试打开你们的xcode,看看是不是用户协议的问题。 来源: CSDN 作者: iCheer-xu 链接: https://blog.csdn.net/qq_36071963/article/details/103578696

Get document DOCTYPE with BeautifulSoup

喜欢而已 提交于 2019-12-17 14:01:58
问题 I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the following html: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <meta charset=utf-8 /> <meta name="viewport" content="width=620" /> <title>HTML5 Demos and Examples</title>

Python Scrapy: Convert relative paths to absolute paths

自作多情 提交于 2019-12-17 09:35:39
问题 I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here. from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.response import get_base_url from scrapy.utils.url import urljoin_rfc from dmoz2.items import DmozItem class DmozSpider(BaseSpider): name = "namastecopy2" allowed_domains = ["namastefoods.com"] start_urls = [ "http://www.namastefoods.com/products/cgi-bin/products.cgi

Python爬虫Scrapy框架:Request

白昼怎懂夜的黑 提交于 2019-12-17 08:39:31
一 .Request 1.request Scarpy中的HTTP请求对象 1.1.Requse的构造 #我们ctrl+左键可以看到Scarpy.Request的代码 class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None): 其中我们可以看出描述他的有这几个因素 1.url:请求页面的地址(必须有这个参数) 2.callback:页面解析参数,(默认调用Spider的parse的方法) 3.method:http的请求方法,默认为'GET' 4.header:请求头部字典,NONE是不发生送给COOKIES 5.body:请求正文,bytes或者str数据类型 6.cookies:COOKIES信息字典 7meta:(我没法理解) 8.encoding:编码方式 9.priority:请求优先级,默认值为0 10.dont_filter:默认情况下是False对同一url发送多次请求不过会被过滤

Scrapy: Follow link to get additional Item data?

寵の児 提交于 2019-12-17 07:11:46
问题 I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right? Ultimately I want to scrape the Title , Due Date , and Details for each row. Title and Due Date are immediately available on the page... BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that

Access django models inside of Scrapy

限于喜欢 提交于 2019-12-17 07:00:51
问题 Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up? 回答1: If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module('settings', [path]) project = imp.load_module('settings', f,

Click a Button in Scrapy

非 Y 不嫁゛ 提交于 2019-12-17 06:34:49
问题 I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need. How can I simply click a button, which then shows the information I need? Do I have to use an external library like mechanize or lxml? 回答1: Scrapy cannot interpret