scrapy | 易学教程

python之scrapy爬取某集团招聘信息

阅读更多关于 python之scrapy爬取某集团招聘信息

1、创建工程 scrapy startproject gosuncn 2、创建项目 cd gosuncn scrapy genspider gaoxinxing gosuncn.zhiye.com 3、运行项目 crawl gaoxinxing 4、gaoxinxing.py代码 # -*- coding: utf-8 -*- import scrapy import logging logger = logging.getLogger(__name__) #引入日志 class GaoxinxingSpider(scrapy.Spider): name = 'gaoxinxing' allowed_domains = ['gosuncn.zhiye.com'] start_urls = ['http://gosuncn.zhiye.com/Social'] next_page_num = 1 def parse(self, response): tr_list = response.xpath("//table[@class='jobsTable']/tr")[1:] #print(tr_list) for tr in tr_list: item = {} item["position"]=tr.xpath(".//td[1]/a/text()").extract_first()

scrapy python Request is not defined

阅读更多关于 scrapy python Request is not defined

问题 i found an answer here: code for site in sites: Link = site.xpath('a/@href').extract() CompleteLink = urlparse.urljoin(response.url, Link) yield Request(Link, callback = self.parseOneCar) I got this exception exceptions.NameError: global name 'Request' is not defined what should I import please? 回答1: Short answer: from scrapy.http.request import Request Extended answer: read the docs. 来源： https://stackoverflow.com/questions/21121941/scrapy-python-request-is-not-defined

Scrapy scrapes data but no output to file

阅读更多关于 Scrapy scrapes data but no output to file

问题 I've been getting blank json files despite successfully being able to execute most of the lines in scrapy shell. When I run the command scrapy crawl courses with my courses bot being: from scrapy.spiders import CrawlSpider from scrapy.linkextractors import LinkExtractor from tutorial.items import CoursesItem from bs4 import BeautifulSoup import scrapy class CoursesSpider(CrawlSpider): name = 'courses' allowed_domains = ['guide.berkeley.edu'] start_urls = ['http://guide.berkeley.edu/courses

Extracting Images in Scrapy

阅读更多关于 Extracting Images in Scrapy

问题 I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider. settings.py BOT_NAME = 'healthycomm' SPIDER_MODULES = ['healthycomm.spiders'] NEWSPIDER_MODULE = 'healthycomm.spiders' ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images' items.py class HealthycommItem(scrapy.Item): page_heading = scrapy.Field() page_title = scrapy.Field

Using beautiful soup to clean up scraped HTML from scrapy

阅读更多关于 Using beautiful soup to clean up scraped HTML from scrapy

问题 I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath Now, I'd like to scrape all the titles off this page. The process that I am following is as follows: scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath" which gives me the scrapy shell, inside which I do: >>> sel.xpath('//h3[@class="gs_rt"]/a').extract() [ u'<a href="http://citeseerx.ist.psu

error occurs when installing cryptography for scrapy in virtualenv on OS X [closed]

阅读更多关于 error occurs when installing cryptography for scrapy in virtualenv on OS X [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I was installing scrapy with pip in virtualenv on OS X 10.11 . When it's installing cryptography, it said: building '_openssl' extension cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict

Error when deploying scrapy project on the scrapy cloud

阅读更多关于 Error when deploying scrapy project on the scrapy cloud

问题 I am using scrapy 0.20 on Python 2.7. I want to deploy my scrapy project on scrapy cloud I developed my scrapy project with simple spider. navigate to my scrapy project folder. typed scrapy deploy scrapyd -d koooraspider on cmd. Where koooraspider is my project's name, and scrapyd is my target. I got the following error: D:\Walid-Project\Tasks\koooraspider>scrapy deploy scrapyd -p koooraspider Packing version 1395847344 Deploying to project "koooraspider" in http://dash.scrapinghub.com/api

Error when deploying scrapy project on the scrapy cloud

阅读更多关于 Error when deploying scrapy project on the scrapy cloud

Scrapy does not crawl after first page

阅读更多关于 Scrapy does not crawl after first page

问题 I am hitting a dead end with this problem I am having for 4 days. I want to crawl "http://www.ledcor.com/careers/search-careers". On each job listing page (i.e. http://www.ledcor.com/careers/search-careers?page=2) I go inside each job link and get the job title. I have this working so far. Now, I am trying to make the spider go to next job listing page (i.g. from http://www.ledcor.com/careers/search-careers?page=2 to http://www.ledcor.com/careers/search-careers?page=3 and crawl all the jobs).

scrapy startproject tutorial: Error when running this command

阅读更多关于 scrapy startproject tutorial: Error when running this command

问题 I got this error when I was trying to create a new Scrapy project. C:\Windows\system32>Scrapy startproject tutorial c:\Python27\lib\site-packages\twisted\internet\_sslverify.py:184: UserWarning: Y ou do not have the service_identity module installed. Please install it from <ht tps://pypi.python.org/pypi/service_identity>. Without the service_identity modul e and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimenta ry TLS client hostnameverification. Many valid