scrapy

python之scrapy爬取某集团招聘信息

送分小仙女□ 提交于 2019-12-23 16:33:44
1、创建工程 scrapy startproject gosuncn 2、创建项目 cd gosuncn scrapy genspider gaoxinxing gosuncn.zhiye.com 3、运行项目 crawl gaoxinxing 4、gaoxinxing.py代码 # -*- coding: utf-8 -*- import scrapy import logging logger = logging.getLogger(__name__) #引入日志 class GaoxinxingSpider(scrapy.Spider): name = 'gaoxinxing' allowed_domains = ['gosuncn.zhiye.com'] start_urls = ['http://gosuncn.zhiye.com/Social'] next_page_num = 1 def parse(self, response): tr_list = response.xpath("//table[@class='jobsTable']/tr")[1:] #print(tr_list) for tr in tr_list: item = {} item["position"]=tr.xpath(".//td[1]/a/text()").extract_first()

scrapy python Request is not defined

老子叫甜甜 提交于 2019-12-23 15:38:12
问题 i found an answer here: code for site in sites: Link = site.xpath('a/@href').extract() CompleteLink = urlparse.urljoin(response.url, Link) yield Request(Link, callback = self.parseOneCar) I got this exception exceptions.NameError: global name 'Request' is not defined what should I import please? 回答1: Short answer: from scrapy.http.request import Request Extended answer: read the docs. 来源: https://stackoverflow.com/questions/21121941/scrapy-python-request-is-not-defined

Scrapy scrapes data but no output to file

懵懂的女人 提交于 2019-12-23 15:30:27
问题 I've been getting blank json files despite successfully being able to execute most of the lines in scrapy shell. When I run the command scrapy crawl courses with my courses bot being: from scrapy.spiders import CrawlSpider from scrapy.linkextractors import LinkExtractor from tutorial.items import CoursesItem from bs4 import BeautifulSoup import scrapy class CoursesSpider(CrawlSpider): name = 'courses' allowed_domains = ['guide.berkeley.edu'] start_urls = ['http://guide.berkeley.edu/courses

Extracting Images in Scrapy

核能气质少年 提交于 2019-12-23 12:07:48
问题 I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider. settings.py BOT_NAME = 'healthycomm' SPIDER_MODULES = ['healthycomm.spiders'] NEWSPIDER_MODULE = 'healthycomm.spiders' ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images' items.py class HealthycommItem(scrapy.Item): page_heading = scrapy.Field() page_title = scrapy.Field

Using beautiful soup to clean up scraped HTML from scrapy

时间秒杀一切 提交于 2019-12-23 09:58:08
问题 I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath Now, I'd like to scrape all the titles off this page. The process that I am following is as follows: scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath" which gives me the scrapy shell, inside which I do: >>> sel.xpath('//h3[@class="gs_rt"]/a').extract() [ u'<a href="http://citeseerx.ist.psu

error occurs when installing cryptography for scrapy in virtualenv on OS X [closed]

核能气质少年 提交于 2019-12-23 08:35:59
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I was installing scrapy with pip in virtualenv on OS X 10.11 . When it's installing cryptography, it said: building '_openssl' extension cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict

Error when deploying scrapy project on the scrapy cloud

喜你入骨 提交于 2019-12-23 06:44:00
问题 I am using scrapy 0.20 on Python 2.7. I want to deploy my scrapy project on scrapy cloud I developed my scrapy project with simple spider. navigate to my scrapy project folder. typed scrapy deploy scrapyd -d koooraspider on cmd. Where koooraspider is my project's name, and scrapyd is my target. I got the following error: D:\Walid-Project\Tasks\koooraspider>scrapy deploy scrapyd -p koooraspider Packing version 1395847344 Deploying to project "koooraspider" in http://dash.scrapinghub.com/api

Error when deploying scrapy project on the scrapy cloud

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-23 06:42:21
问题 I am using scrapy 0.20 on Python 2.7. I want to deploy my scrapy project on scrapy cloud I developed my scrapy project with simple spider. navigate to my scrapy project folder. typed scrapy deploy scrapyd -d koooraspider on cmd. Where koooraspider is my project's name, and scrapyd is my target. I got the following error: D:\Walid-Project\Tasks\koooraspider>scrapy deploy scrapyd -p koooraspider Packing version 1395847344 Deploying to project "koooraspider" in http://dash.scrapinghub.com/api

Scrapy does not crawl after first page

。_饼干妹妹 提交于 2019-12-23 06:09:10
问题 I am hitting a dead end with this problem I am having for 4 days. I want to crawl "http://www.ledcor.com/careers/search-careers". On each job listing page (i.e. http://www.ledcor.com/careers/search-careers?page=2) I go inside each job link and get the job title. I have this working so far. Now, I am trying to make the spider go to next job listing page (i.g. from http://www.ledcor.com/careers/search-careers?page=2 to http://www.ledcor.com/careers/search-careers?page=3 and crawl all the jobs).

scrapy startproject tutorial: Error when running this command

百般思念 提交于 2019-12-23 05:44:06
问题 I got this error when I was trying to create a new Scrapy project. C:\Windows\system32>Scrapy startproject tutorial c:\Python27\lib\site-packages\twisted\internet\_sslverify.py:184: UserWarning: Y ou do not have the service_identity module installed. Please install it from <ht tps://pypi.python.org/pypi/service_identity>. Without the service_identity modul e and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimenta ry TLS client hostnameverification. Many valid