scrapy

One spider with 2 different URL and 2 parse using Scrapy

不羁岁月 提交于 2019-12-25 00:14:36
问题 Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please? class SalesitemSpiderSpider(scrapy.Spider): name = 'salesitem_spider' allowed_domains = ['www2.hm.com','www.forever21.com'] url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'] #Json Payload code here def start

Scrapy linkextractor ignores parameters behind the sign # and thus will not follow the link

≡放荡痞女 提交于 2019-12-24 23:55:21
问题 I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page. e.g.: http://www.rolex.de/de/watches/find-rolex.html#g=1&p=2 If you enter a question mark manually, the site will load page 1 http://www.rolex.de/de/watches/find-rolex.html?p=2 The stats from scrapy tell me it fetched the first page: DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust

No module named CFFI from FFI

淺唱寂寞╮ 提交于 2019-12-24 23:34:18
问题 I am very new to Python. I installed scrapy and it installed properly but when I want to run it using scrapy command it says "No module named CFFI from FFI" any help please? 回答1: Run the python IDLE shell and type in import cffi. If there's no issue in importing, then CFFI is actually installed and the issue is something else. The best way to install python modules is pip install cffi. If you already have it installed, upgrade using the pip command. 来源: https://stackoverflow.com/questions

python scrapy how to use BaseDupeFilter

一曲冷凌霜 提交于 2019-12-24 23:13:13
问题 I have a website have many pages like this: mywebsite/?page=1 mywebsite/?page=2 ... ... ... mywebsite/?page=n each page have links to players. when you click on any link, you go to the page of that player. Users can add players so I will end up with this situation. Player1 has a link in page=1 . Player10 has a link in page=2 after an hour because users have added new players. i will have this situation. Player1 has a link in page=3 Player10 has a link in page=4 and the new players like

Python之路~Scrapy安装

十年热恋 提交于 2019-12-24 22:20:34
Scarpy框架的安装 环境:Anconda Version:conda 4.8.0 开始安装 如果直接使用命令 pip install scrapy 或者 conda install scrapy 可能会报错。安装前需要安装其他的第三方包。 我这里测试过(我安装的是Python的集成包anconda),只需要安装Twisted模块,再通过 conda install scrapy 就能安装成功。 1、下载Twisted 链接地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#genshi 找到Twisted,根据操作系统的位数选择对应的版本。 下载完成后,会得到一个.whl结尾的文件。 conda install Twisted-19.10.0-cp38-cp38-win_amd64.whl 2、安装Scrapy 使用命令安装, conda install scrapy 3、测试 如果输入import scrapy 没有报错,则说明安装成功了! PS:安装方法简单粗暴,正好适合新手安装,不然一堆的报错,容易让人有挫败感。也是做个学习笔记。 来源: CSDN 作者: 包子加入侵 链接: https://blog.csdn.net/Baozijiaruqing/article/details/103688872

Selenium to scroll through ifinite scrollable pages and scrapping data through scrapy

对着背影说爱祢 提交于 2019-12-24 21:38:36
问题 I am very new to data scrapping and scrappy,I want scrapy to use the page source output which i got using selenium webdriver to scrap data using xpath! Can anybody help me with that. i am getting an error AttributeError: 'unicode' object has no attribute 'text' I think i am getting the output in string format and scrapy is not able to covert it. below is the code snippet generating the error. def parse(self, response): # process each category link urls = response.xpath('//div[contains(@class,

Scrapy API - Pass custom logger

强颜欢笑 提交于 2019-12-24 21:35:30
问题 I am using the API to run Scrapy from a script (Python 3.5, Scrapy 1.5). The main script calls a function to deal with its logging: def main(target_year): project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0] iso_run_date = datetime.date.today().isoformat() logger = utils.get_logger(project, iso_run_date) scraping.run(project, iso_run_date, target_year) Here is the function in the file "utils.py", with an additional class for formatting, that creates a logger with Python

IF Statement within Scrapy item declaration

徘徊边缘 提交于 2019-12-24 20:24:23
问题 I'm using scrapy to build a spider to monitor prices on a website. The website isn't consistent in how it displays it's prices. For it's standard price, it always uses the same CSS class, however when a product goes on promotion, it uses one of two CSS classes. The CSS selectors for both are below: response.css('span.price-num:last-child::text').extract_first() response.css('.product-highlight-label') Below is how my items currently look within my spider: item = ScraperItem() item['model'] =

I want to add item class within an item class

风流意气都作罢 提交于 2019-12-24 20:14:18
问题 Final JSON will be : "address": ----, "state": ----, year: { "first": ----, "second": { "basic": ----, "Information": ----, } }, I want to create my items.py like (just example): class Item(scrapy.Item): address = scrapy.Field() state = scrapy.Field() year = scrapy.Field(first), scrapy.Field(second) class first(scrapy.Item): amounts = scrapy.Field() class second(scrapy.Item): basic = scrapy.Field() information = scrapy.Field() How to implement this , already checked this https://doc.scrapy

Unable to generate csv for next pages using scrapy

百般思念 提交于 2019-12-24 19:44:30
问题 I am newbie to python and scrapy Here is my code to get all the productname,price,image,title from all the next pages import scrapy class TestSpider(scrapy.Spider): name = "testdoc1" start_urls = ["https://www.amazon.in/s/ref=amb_link_46?ie=UTF8&bbn=1389432031&rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cp_89%3AApple&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-leftnav&pf_rd_r=CYS25V3W021MSYPQ32FB&pf_rd_r=CYS25V3W021MSYPQ32FB&pf_rd_t=101&pf