scrapy

Writing a crawler to parse a site in scrapy using BaseSpider

ⅰ亾dé卋堺 提交于 2020-01-06 02:57:07
问题 I am getting confused on how to design the architecure of crawler. I have the search where I have pagination: next page links to follow a list of products on one page individual links to be crawled to get the description I have the following code: def parse_page(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ol[@id=\'result-set\']/li') items = [] for site in sites[:2]: item = MyProduct() item['product'] = myfilter(site.select('h2/a').select("string()").extract())

Pass values into scrapy callback

此生再无相见时 提交于 2020-01-06 02:51:26
问题 I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like. The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page. I was hoping someone could help me figure out how to pass a unique id to each callback so it can be used as the filename when

Anaconda 安装scrapy

梦想的初衷 提交于 2020-01-05 21:20:08
版权声明:本文为博主原创文章,未经博主允许不得转载。 1.下载Anaconda,用的清华的镜像 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/ 安装一路next即可 第一个勾:是否把Anaconda加入环境变量 第二个勾:是否设置Anaconda所带的Python 3.6为系统默认的Python版本 2.Anaconda安装成功之后,我们需要修改其包管理镜像为国内源。 在cmd总执行下面两行代码即可。 conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config --set show_channel_urls yes 3.安装Scrapy 本来直接用命令安装conda install scrapy 可是如下错误 C:\Windows\System32>conda install scrapy Fetching package metadata ..... CondaHTTPError: HTTP None None for url <https://repo.continuum.io/pkgs/free/win- 64/repodata.json.bz2> Elapsed: None An

How to avoid redirection of the webcrawler to the mobile edition?

时间秒杀一切 提交于 2020-01-05 12:17:23
问题 I subclassed a CrawlSpider and want to extract data from website. However, I always get redirected to the site's mobile version. I tried to change the USER_AGENT variable in scrapy's settings to Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 , but still get redirected. Is there another way to signal another client and avoid redirection? 回答1: There are two types of redirection supported in Scrapy: RedirectMiddleware - Handle redirection of requests

scrapy: request url must be str or unicode got list

可紊 提交于 2020-01-05 12:16:20
问题 I cant quite figure out what's wrong with this code. I would like to scrape the first page, and then, for each link on that page, go to the second page to extract the item description. When i run the code below, i get: exception.TypeError: url must be str or unicode, got list . here is my code: from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request from scrapy.item import Item, Field from scrapy.contrib.loader import ItemLoader from scrapy

SplashRequest - Cannot get data attribute

痴心易碎 提交于 2020-01-05 08:28:12
问题 I'm strugling to find out why I receive error: AttributeError: 'HtmlResponse' object has no attribute 'data' From documentation: SplashJsonResponse provide extra features: response.data attribute contains response data decoded from JSON; you can access it like response.data['html']. Here is my sample code: class HeadphonesSpider(scrapy.Spider): name = "headphones" handle_httpstatus_list = [404] def start_requests(self): splash_args = { 'html': 1, 'png': 1, 'width': 600, 'render_all': 1, }

SplashRequest - Cannot get data attribute

末鹿安然 提交于 2020-01-05 08:28:10
问题 I'm strugling to find out why I receive error: AttributeError: 'HtmlResponse' object has no attribute 'data' From documentation: SplashJsonResponse provide extra features: response.data attribute contains response data decoded from JSON; you can access it like response.data['html']. Here is my sample code: class HeadphonesSpider(scrapy.Spider): name = "headphones" handle_httpstatus_list = [404] def start_requests(self): splash_args = { 'html': 1, 'png': 1, 'width': 600, 'render_all': 1, }

Use Django's models in a Scrapy project (in the pipeline)

半世苍凉 提交于 2020-01-05 07:11:08
问题 This has been asked before but the answer that always comes up is to use DjangoItem. However it states on it's github that: often not a good choice for a write intensive applications (such as a web crawler) ... may not scale well This is the crux of my problem, I'd like to use and interact with my django model in the same way I can when I run python manage.py shell and I do from myapp.models import Model1 . Using queries like seen here. I have tried relative imports and moving my whole scrapy

python scrapy conversion to exe file using pyinstaller

半城伤御伤魂 提交于 2020-01-05 04:27:13
问题 I am trying to convert a scrapy script to a exe file. The main.py file looks like this: from scrapy.crawler import CrawlerProcess from amazon.spiders.amazon_scraper import Spider spider = Spider() process = CrawlerProcess({ 'FEED_FORMAT': 'csv', 'FEED_URI': 'data.csv', 'DOWNLOAD_DELAY': 3, 'RANDOMIZE_DOWNLOAD_DELAY': True, 'ROTATING_PROXY_LIST_PATH': 'proxies.txt', 'USER_AGENT_LIST': 'useragents.txt', 'DOWNLOADER_MIDDLEWARES' : { 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,

Scrapy ImportError: No module named Item

青春壹個敷衍的年華 提交于 2020-01-05 04:10:15
问题 I know that this question was already widely discussed, but I didn't find an answer. I'm getting error ImportError: No module named items . I've created a new project with $ scrapy startproject pluto and I have no equal names (in names of project, classes etc), to avoid problem with naming. pluto_spider.py : import scrapy from items import PlutoItem class PlutoSpider(scrapy.Spider): name = "plutoProj" allowed_domains = ['successories.com'] start_urls = [ 'http://www.successories.com/iquote