scrapy

Scrapy - Extract items from table

…衆ロ難τιáo~ 提交于 2019-12-19 02:31:29
问题 Trying to get my head around Scrapy but hitting a few dead ends. I have a 2 Tables on a page and would like to extract the data from each one then move along to the next page. Tables look like this (First one is called Y1, 2nd is Y2) and structures are the same. <div id="Y1" style="margin-bottom: 0px; margin-top: 15px;"> <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;"> <table class="table table-striped table-hover table-curved"> <thead> <tr> <th class="tCol1" style

Scrapy框架----- Spiders

别说谁变了你拦得住时间么 提交于 2019-12-19 02:13:07
Spider Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。 换句话说,Spider就是您定义爬取的动作及分析某个网页(或者是有些网页)的地方。 class scrapy.Spider 是最基本的类,所有编写的爬虫必须继承这个类。 主要用到的函数及调用顺序为: __init__() : 初始化爬虫名字和start_urls列表 start_requests() 调用make_requests_from url() :生成Requests对象交给Scrapy下载并返回response parse() : 解析response,并返回Item或Requests(需指定回调函数)。Item传给Item pipline持久化 , 而Requests交由Scrapy下载,并由指定的回调函数处理(默认parse()),一直进行循环,直到处理完所有的数据为止。 源码参考 #所有爬虫的基类,用户定义的爬虫必须从这个类继承 class Spider(object_ref): #定义spider名字的字符串(string)。spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的。 #name是spider最重要的属性,而且是必须的。 #一般做法是以该网站(domain)

scrapy error :exceptions.ValueError: Missing scheme in request url:

纵饮孤独 提交于 2019-12-18 21:02:30
问题 I use try except to avoid error,but My terminal still show error but not the log message : raise ValueError('Missing scheme in request url: %s' % self._url) exceptions.ValueError: Missing scheme in request url: How can I avoid this error when scrapy didn't get image_urls? Please guide me ,thank you very much. try: item['image_urls'] = ["".join(image.extract()) ] except: log.msg("no image foung!. url={}".format(response.url),level=log.INFO) 回答1: the image_urls field should be a list, not a str

Write to a csv file scrapy

Deadly 提交于 2019-12-18 19:09:06
问题 I want to write to csv file in scrapy for rss in rsslinks: item = AppleItem() item['reference_link'] = response.url base_url = get_base_url(response) item['rss_link'] = urljoin_rfc(base_url,rss) #item['rss_link'] = rss items.append(item) #items.append("\n") f = open(filename,'a+') #filename is apple.com.csv for item in items: f.write("%s\n" % item) My output is this: {'reference_link': 'http://www.apple.com/' 'rss_link': 'http://www.apple.com/rss ' {'reference_link': 'http://www.apple.com/rss

How to scrap text included between various tags using scrapy

别等时光非礼了梦想. 提交于 2019-12-18 18:37:57
问题 I am trying to scrap product description from this link. But how do i scrap the whole text including text between tags. Here is the hxs object hxs.select('//div[@class="overview"]/div/text()').extract() but the original HTML : These classic sneakers from <b>Puma</b> are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a <b>leather and synthetic upper.</b> A vulcanized non-slip

Using InitSpider with splash: only parsing the login page?

♀尐吖头ヾ 提交于 2019-12-18 15:04:11
问题 This is sort of a follow-up question to one I asked earlier. I'm trying to scrape a webpage which I have to login to reach first. But after authentication, the webpage I need requires a little bit of Javascript to be run before you can view the content. What I've done is followed the instructions here to install splash to try to render the Javascript. However... Before I switched to splash, the authentication with Scrapy's InitSpider was fine. I was getting through the login page and scraping

scrapy: 'module' object has no attribute 'OP_SINGLE_ECDH_USE'

怎甘沉沦 提交于 2019-12-18 13:36:26
问题 I am new in scrapy, I create a sample project in scrapy and run the project. I got an error AttributeError: 'module' object has no attribute 'OP_SINGLE_ECDH_USE' Code: import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = ["https://www.grocerygateway.com"] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body) Thanks in advance 回答1: I had a similar error, found that pyopenssl

How to prevent a twisted.internet.error.ConnectionLost error when using Scrapy?

孤街浪徒 提交于 2019-12-18 13:02:55
问题 I'm scraping some pages with scrapy and get the following error: twisted.internet.error.ConnectionLost My command line output: 2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-05-04 18:40:32+0800 [cnproxy] DEBUG:

How to set different scrapy-settings for different spiders?

随声附和 提交于 2019-12-18 12:53:17
问题 I want to enable some http-proxy for some spiders, and disable them for other spiders. Can I do something like this? # settings.py proxy_spiders = ['a1' , b2'] if spider in proxy_spider: #how to get spider name ??? HTTP_PROXY = 'http://127.0.0.1:8123' DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.RandomUserAgentMiddleware': 400, 'myproject.middlewares.ProxyMiddleware': 410, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None } else: DOWNLOADER_MIDDLEWARES = {

windows scrapyd-deploy is not recognized

别等时光非礼了梦想. 提交于 2019-12-18 12:42:11
问题 I have install the scrapyd like this pip install scrapyd I want to use scrapyd-deploy when i type scrapyd i got this exception in cmd: 'scrapyd' is not recognized as an internal or external command, operable program or batch file. 回答1: I ran into the same issue, and I also read some opinions that scrapyd isn't available / can't run on windows and nearly gave it up (didn't really need it as I intend on deploying to a linux machine, wanted scrapyd on windows for debug purposes). However, after