scrapy | 易学教程

How to scrape address from websites using Scrapy? [closed]

阅读更多关于 How to scrape address from websites using Scrapy? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples

Installation Scrapy Error on Mac 10.9.1 using pip

阅读更多关于 Installation Scrapy Error on Mac 10.9.1 using pip

问题 I'm trying to install scrapy on mac os 10.9.1 sudo pip install scrapy cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/k6/g5dx4fj91tdf6f4_28p6fh980000gn/T/pip_build_tommy/lxml/src/lxml

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

阅读更多关于 Why isn't XMLFeedSpider failing to iterate through the designated nodes?

问题 I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This configuration produces the following log output (note the

Argument must be in bytes or unicode, got list

阅读更多关于 Argument must be in bytes or unicode, got list

问题 I'm coding a Scrapy project. I've tested everything, but when I parse a page it returns TypeError: Argument must be bytes or unicode, got 'list' I've tested everything in the shell using this link. And I can't seem to find where it's having a problem. All of my shell commands returned only one item (i.e. there was no comma.) Does anyone know why this might be the case? from scrapy.spiders import Spider from scrapy.selector import HtmlXPathSelector from scrapy.loader import XPathItemLoader

Scrapy doesn't crawl the page

阅读更多关于 Scrapy doesn't crawl the page

问题 I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B by scrapy. But seems there is a problem that I didn't get any data when crawling it. Here is my spider code: import scrapy from scrapy.selector import Selector from scrapy_Data.items import CharProt class CPSpider(scrapy.Spider): name = "CharProt" allowed_domains = ["jcvi.org"] start_urls = ["http://www.jcvi.org/charprotdb

How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

阅读更多关于 How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

问题 As Flipkart.com shows only 15 to 20 results on 1st page and when scrolled it shows more results. Scrapy extracts results of 1st page successfully but not of next pages. I tried using Selenium for it, but couldn't find success. Here is my code :- from scrapy.spider import Spider from scrapy.selector import Selector from flipkart.items import FlipkartItem from scrapy.spider import BaseSpider from selenium import webdriver class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = [

Scrapy: No module named items, scraping images

阅读更多关于 Scrapy: No module named items, scraping images

问题 I'm trying a example that use scrapy to download images form a web pages. This is the spider file: from scrapy import Spider, Item, Field, Request from items import TrousersItem class TrouserScraper(Spider): name, start_urls = "Trousers_spider", ["http://lookatmyfuckingredtrousers.blogspot.co.uk"] def parse(self, response): for image in response.selector.xpath('//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'): yield TrousersItem(image_urls=[image.extract(

Scrapy needs to crawl all the next links on website and move on to the next page

阅读更多关于 Scrapy needs to crawl all the next links on website and move on to the next page

问题 I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from delh.items import DelhItem class criticspider(CrawlSpider): name ="delh" allowed_domains =["consumercomplaints.in"] #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/

How to install portia, a python application from Github (Mac)

阅读更多关于 How to install portia, a python application from Github (Mac)

问题 I am attempting to install Portia, a python app from Github: https://github.com/scrapinghub/portia I use the following steps at the command line: set up new virtualenv 'portia' in Mac terminal git clone https://github.com/scrapinghub/portia.git follow readme instructions: cd slyd pip install -r requirements.txt run Portia cd slyd twistd -n slyd But every time I attempt the last step to run the program, I get the following error: ImportError: No module named scrapy Any idea why this error is

Scrapy results are repeating

阅读更多关于 Scrapy results are repeating

问题 I am trying to get names of the songs from this site https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html using link extractor but the results are repeating. import scrapy from scrapy import Request from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class RedditSpider(CrawlSpider): name='pagalworld' allowed_domains = ["pagalworld.me"] start_urls=['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi