scrapy | 易学教程

One spider with 2 different URL and 2 parse using Scrapy

阅读更多关于 One spider with 2 different URL and 2 parse using Scrapy

问题 Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please? class SalesitemSpiderSpider(scrapy.Spider): name = 'salesitem_spider' allowed_domains = ['www2.hm.com','www.forever21.com'] url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'] #Json Payload code here def start

Scrapy linkextractor ignores parameters behind the sign # and thus will not follow the link

阅读更多关于 Scrapy linkextractor ignores parameters behind the sign # and thus will not follow the link

问题 I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page. e.g.: http://www.rolex.de/de/watches/find-rolex.html#g=1&p=2 If you enter a question mark manually, the site will load page 1 http://www.rolex.de/de/watches/find-rolex.html?p=2 The stats from scrapy tell me it fetched the first page: DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust

No module named CFFI from FFI

阅读更多关于 No module named CFFI from FFI

问题 I am very new to Python. I installed scrapy and it installed properly but when I want to run it using scrapy command it says "No module named CFFI from FFI" any help please? 回答1: Run the python IDLE shell and type in import cffi. If there's no issue in importing, then CFFI is actually installed and the issue is something else. The best way to install python modules is pip install cffi. If you already have it installed, upgrade using the pip command. 来源： https://stackoverflow.com/questions

python scrapy how to use BaseDupeFilter

阅读更多关于 python scrapy how to use BaseDupeFilter

问题 I have a website have many pages like this: mywebsite/?page=1 mywebsite/?page=2 ... ... ... mywebsite/?page=n each page have links to players. when you click on any link, you go to the page of that player. Users can add players so I will end up with this situation. Player1 has a link in page=1 . Player10 has a link in page=2 after an hour because users have added new players. i will have this situation. Player1 has a link in page=3 Player10 has a link in page=4 and the new players like

Python之路~Scrapy安装

阅读更多关于 Python之路~Scrapy安装

Scarpy框架的安装环境：Anconda Version：conda 4.8.0 开始安装如果直接使用命令 pip install scrapy 或者 conda install scrapy 可能会报错。安装前需要安装其他的第三方包。我这里测试过（我安装的是Python的集成包anconda)，只需要安装Twisted模块，再通过 conda install scrapy 就能安装成功。 1、下载Twisted 链接地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/#genshi 找到Twisted，根据操作系统的位数选择对应的版本。下载完成后，会得到一个.whl结尾的文件。 conda install Twisted-19.10.0-cp38-cp38-win_amd64.whl 2、安装Scrapy 使用命令安装， conda install scrapy 3、测试如果输入import scrapy 没有报错，则说明安装成功了！ PS：安装方法简单粗暴，正好适合新手安装，不然一堆的报错，容易让人有挫败感。也是做个学习笔记。来源： CSDN 作者：包子加入侵链接： https://blog.csdn.net/Baozijiaruqing/article/details/103688872

Selenium to scroll through ifinite scrollable pages and scrapping data through scrapy

阅读更多关于 Selenium to scroll through ifinite scrollable pages and scrapping data through scrapy

问题 I am very new to data scrapping and scrappy,I want scrapy to use the page source output which i got using selenium webdriver to scrap data using xpath! Can anybody help me with that. i am getting an error AttributeError: 'unicode' object has no attribute 'text' I think i am getting the output in string format and scrapy is not able to covert it. below is the code snippet generating the error. def parse(self, response): # process each category link urls = response.xpath('//div[contains(@class,

Scrapy API - Pass custom logger

阅读更多关于 Scrapy API - Pass custom logger

问题 I am using the API to run Scrapy from a script (Python 3.5, Scrapy 1.5). The main script calls a function to deal with its logging: def main(target_year): project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0] iso_run_date = datetime.date.today().isoformat() logger = utils.get_logger(project, iso_run_date) scraping.run(project, iso_run_date, target_year) Here is the function in the file "utils.py", with an additional class for formatting, that creates a logger with Python

IF Statement within Scrapy item declaration

阅读更多关于 IF Statement within Scrapy item declaration

问题 I'm using scrapy to build a spider to monitor prices on a website. The website isn't consistent in how it displays it's prices. For it's standard price, it always uses the same CSS class, however when a product goes on promotion, it uses one of two CSS classes. The CSS selectors for both are below: response.css('span.price-num:last-child::text').extract_first() response.css('.product-highlight-label') Below is how my items currently look within my spider: item = ScraperItem() item['model'] =

I want to add item class within an item class

阅读更多关于 I want to add item class within an item class

问题 Final JSON will be : "address": ----, "state": ----, year: { "first": ----, "second": { "basic": ----, "Information": ----, } }, I want to create my items.py like (just example): class Item(scrapy.Item): address = scrapy.Field() state = scrapy.Field() year = scrapy.Field(first), scrapy.Field(second) class first(scrapy.Item): amounts = scrapy.Field() class second(scrapy.Item): basic = scrapy.Field() information = scrapy.Field() How to implement this , already checked this https://doc.scrapy

Unable to generate csv for next pages using scrapy

阅读更多关于 Unable to generate csv for next pages using scrapy

问题 I am newbie to python and scrapy Here is my code to get all the productname,price,image,title from all the next pages import scrapy class TestSpider(scrapy.Spider): name = "testdoc1" start_urls = ["https://www.amazon.in/s/ref=amb_link_46?ie=UTF8&bbn=1389432031&rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cp_89%3AApple&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-leftnav&pf_rd_r=CYS25V3W021MSYPQ32FB&pf_rd_r=CYS25V3W021MSYPQ32FB&pf_rd_t=101&pf