scrapy | 易学教程

How to iterate over divs in Scrapy?

阅读更多关于 How to iterate over divs in Scrapy?

问题 It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code. My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items). Thanks

How to iterate over divs in Scrapy?

阅读更多关于 How to iterate over divs in Scrapy?

Crawl and Concatenate in Scrapy

阅读更多关于 Crawl and Concatenate in Scrapy

问题 I'm trying to crawl movie list with Scrapy (I take only the Director & Movie title fields). Sometimes, there are two directors and Scrapy scape them as different. So the first director will be alon the movie title but for the second there will be no movie title. So I created a condition like this : if director2: item['director'] = map(unicode.strip,titres.xpath("tbody/tr/td/div/div[2]/div[3]/div[2]/div/h2/div/a/text()").extract()) The last div[2] exists only if there are two directors. And I

Scrapy With Splash Only Scrapes 1 Page

阅读更多关于 Scrapy With Splash Only Scrapes 1 Page

问题 I am trying to scrape multiple URLs, but for some reason only results for 1 site show. In every case it is the last URL in start_urls that is shown. I believe I have the problem narrowed down to my parse function. Any ideas on what I'm doing wrong? Thanks! class HeatSpider(scrapy.Spider): name = "heat" start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination

Scrapy - select xpath with a regular expression

阅读更多关于 Scrapy - select xpath with a regular expression

问题 Part of the html that I am scraping looks like this: <h2> Profile</h2> <ul><li> Name Albert Einstein </li><li> Birth Name: Alberto Ein </li><li> Birthdate: December 24, 1986 </li><li> Birthplace: <a href="/Ulm" title="Dest">Ulm</a>, Germany </li><li> Height: 178cm </li><li> Blood Type: A </li></ul> I want to extract each component - so name, birth name, birthday, etc. To extract the name I do: a_name =

Scrapy Splash click button doesn't work

阅读更多关于 Scrapy Splash click button doesn't work

问题 What I'm trying to do On avito.ru (Russian real estate site), person's phone is hidden until you click on it. I want to collect the phone using Scrapy+Splash. Example URL: https://www.avito.ru/moskva/kvartiry/2-k_kvartira_84_m_412_et._992361048 After you click the button, pop-up is displayed and phone is visible. I'm using Splash execute API with following Lua script: function main(splash) splash:go(splash.args.url) splash:wait(10) splash:runjs("document.getElementsByClassName('item-phone

Dynamically assembling scrapy GET request string

阅读更多关于 Dynamically assembling scrapy GET request string

问题 I've been working with firebug and I've got the following dictionaries to query an api. url = "htp://my_url.aspx#top" querystring = {"dbkey":"x1","stype":"id","s":"27"} headers = { 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 'upgrade-insecure-requests': "1", 'user-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 } with python requests, using this is as simple as: import requests response =

Dynamically assembling scrapy GET request string

阅读更多关于 Dynamically assembling scrapy GET request string

网络爬虫之scrapy框架设置代理

阅读更多关于网络爬虫之scrapy框架设置代理

前戏 os.environ()简介 os.environ()可以获取到当前进程的环境变量，注意，是当前进程。如果我们在一个程序中设置了环境变量，另一个程序是无法获取设置的那个变量的。环境变量是以一个字典的形式存在的，可以用字典的方法来取值或者设置值。 os.environ() key字段详解 windows： os.environ['HOMEPATH']:当前用户主目录。 os.environ['TEMP']:临时目录路径。 os.environ[PATHEXT']:可执行文件。 os.environ['SYSTEMROOT']:系统主目录。 os.environ['LOGONSERVER']:机器名。 os.environ['PROMPT']:设置提示符。 linux： os.environ['USER']:当前使用用户。 os.environ['LC_COLLATE']:路径扩展的结果排序时的字母顺序。 os.environ['SHELL']:使用shell的类型。 os.environ['LAN']:使用的语言。 os.environ['SSH_AUTH_SOCK']:ssh的执行路径。内置的方式原理 scrapy框架内部已经实现了设置代理的方法，它的原理是从环境变量中取出设置的代理，然后再使用，所以我们只需要在程序执行前将代理以键值对的方式设置到环境变量中即可。

scrapy spider not found

阅读更多关于 scrapy spider not found

问题 I'm trying to reproduce the code of this talk: https://www.youtube.com/watch?v=eD8XVXLlUTE When I try to run the spider: scrapy crawl talkspider_basic I got this error: raise KeyError("Spider not found: {}".format(spider_name)) KeyError: 'Spider not found: talkspider_basic' The code of the spider is: from scrapy.spiders import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.loader import