scrapy | 易学教程

Windows virtualenv下安装Scrapy的各种依赖

阅读更多关于 Windows virtualenv下安装Scrapy的各种依赖

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Scrapy依赖的Openssl、libxml在Windows各种麻烦，先照着Scrapy官方安装教程解决Openssl依赖，然后依据pip install Scrapy命令的出错提示再安装几个包的Windows版本，去包的官网下或Pypi找。一句话，Scrapy这货自己是不大，但其依赖的东西好多。 Scrapy安装总结：这货依赖如下包，没列出依赖的依赖： Scrapy==0.16.5 Twisted==13.1.0 lxml==3.0.1 pyOpenSSL==0.13 w3lib==1.3 zope.interface==4.0.5 因在virtualenv环境下安装，其中lxml下载的只有exe格式，安装时不能选择安装在哪个Python环境下，所以先安装到主Python环境下，然后site-package下把lxml两个文件copy到virtualenv对应目录下完成安装。其它如pyOpenSSL有提供msi文件，安装时可选择安装在哪个Python环境下，相当不错。当然还有终极方法，就是下载源码，然后用easy_install安装，加--complie=MinW之类的编辑器参数，在此先这样了。 2013.07.12更新：原来exe格式的二进制除了双击安装外，可以在cmd下安装

Python data scraping with Scrapy

阅读更多关于 Python data scraping with Scrapy

问题 I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page. I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task? (edited) I want to scrape the data from the following website: http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType My requirement is to select the values from

Scrapy upload file

阅读更多关于 Scrapy upload file

问题 I am making a form request to a website using scrapy. The form requires to upload a pdf file, How can we do it in Scrapy. I am trying this like - FormRequest(url,callback=self.parseSearchResponse,method="POST",formdata={'filename':'abc.xyz','file':'path to file/abc.xyz'}) 回答1: At this very moment Scrapy has no built-in support for uploading files. File uploading via forms in HTTP was specified in RFC1867. According to the spec, an HTTP request with Content-Type: multipart/form-data is

Scrapy import module items error

阅读更多关于 Scrapy import module items error

问题 My project structure: kmss/ ├── kmss │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ └── first.py ├── README.rst ├── scrapy.cfg └── setup.py I am running it on mac and my project folder is created at the location: /user/username/kmss And within items.py I do have a class named " KmssItem " . If I am going to run the first.py ( my spider), I have to import items.py. , which is at a higher level. I am having problem with the following line

Scrapy project can't find django.core.management

阅读更多关于 Scrapy project can't find django.core.management

问题 I'm trying to follow the method here to 'Scrapy' data from the web and simultaneously save that data directly to my Django database using Scrapy's item pipeline. However, when I try to run scrapy crawl spidername , I'm getting the error: ImportError: No module named django.core.management At first I thought it was because my Scrapy project was outside of my Django project folder, but even after I moved the whole project into my Django project folder I kept getting the same error. If I open a

How yo make a selenium Scripts faster?

阅读更多关于 How yo make a selenium Scripts faster?

问题 I use python Selenium and Scrapy for crawling a website. but my script is so slow, Crawled 1 pages (at 1 pages/min) i use CSS SELECTOR instead of XPATH for optimise the time. i change the midllewares 'tutorial.middlewares.MyCustomDownloaderMiddleware': 543, is Selenium is too slow or i should change something in Setting? my Code: def start_requests(self): yield Request(self.start_urls, callback=self.parse) def parse(self, response): display = Display(visible=0, size=(800, 600)) display.start(

Using loginform with scrapy

阅读更多关于 Using loginform with scrapy

问题 The scrapy framework (https://github.com/scrapy/scrapy) provides a library for use when logging into websites that require authentication, https://github.com/scrapy/loginform. I have looked through the docs for both programs however I cannot seem to figure out how to get scrapy to call loginform before running. The login works fine with just loginform. Thanks 回答1: loginform is just a library, totally decoupled from Scrapy. You have to write the code to plug it in the spider you want, probably

Pagination using scrapy

阅读更多关于 Pagination using scrapy

问题 I'm trying to crawl this website: http://www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html I can get all the products in this page, but how do I issue the request for "View More" link at the bottom of the page? My code till now is: rules = ( Rule(SgmlLinkExtractor(restrict_xpaths='//li[@class="normalLeft"]/div/a',unique=True)), Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="topParentChilds"]/div/div[@class="clm2"]/a',unique=True)), Rule(SgmlLinkExtractor(restrict_xpaths='//p[@class

scrapy log handler

阅读更多关于 scrapy log handler

问题 I seek your help in the following 2 questions - How do I set the handler for the different log levels like in python. Currently, I have STATS_ENABLED = True STATS_DUMP = True LOG_FILE = 'crawl.log' But the debug messages generated by Scrapy are also added into the log files. Those are very long and ideally, I would like the DEBUG level messages to left on standard error and INFO messages to be dump to my LOG_FILE . Secondly, in the docs, it says The logging service must be explicitly started

Parsing HTML with XPath, Python and Scrapy

阅读更多关于 Parsing HTML with XPath, Python and Scrapy

问题 I am writing a Scrapy program to extract the data. This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path: /html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2] While I am trying to execute this try: temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]