scrapy

How to scrape address from websites using Scrapy? [closed]

有些话、适合烂在心里 提交于 2019-12-24 14:16:33
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples

Installation Scrapy Error on Mac 10.9.1 using pip

故事扮演 提交于 2019-12-24 14:13:45
问题 I'm trying to install scrapy on mac os 10.9.1 sudo pip install scrapy cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/k6/g5dx4fj91tdf6f4_28p6fh980000gn/T/pip_build_tommy/lxml/src/lxml

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

偶尔善良 提交于 2019-12-24 13:40:27
问题 I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This configuration produces the following log output (note the

Argument must be in bytes or unicode, got list

扶醉桌前 提交于 2019-12-24 13:33:08
问题 I'm coding a Scrapy project. I've tested everything, but when I parse a page it returns TypeError: Argument must be bytes or unicode, got 'list' I've tested everything in the shell using this link. And I can't seem to find where it's having a problem. All of my shell commands returned only one item (i.e. there was no comma.) Does anyone know why this might be the case? from scrapy.spiders import Spider from scrapy.selector import HtmlXPathSelector from scrapy.loader import XPathItemLoader

Scrapy doesn't crawl the page

你离开我真会死。 提交于 2019-12-24 13:25:55
问题 I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B by scrapy. But seems there is a problem that I didn't get any data when crawling it. Here is my spider code: import scrapy from scrapy.selector import Selector from scrapy_Data.items import CharProt class CPSpider(scrapy.Spider): name = "CharProt" allowed_domains = ["jcvi.org"] start_urls = ["http://www.jcvi.org/charprotdb

How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

China☆狼群 提交于 2019-12-24 12:51:51
问题 As Flipkart.com shows only 15 to 20 results on 1st page and when scrolled it shows more results. Scrapy extracts results of 1st page successfully but not of next pages. I tried using Selenium for it, but couldn't find success. Here is my code :- from scrapy.spider import Spider from scrapy.selector import Selector from flipkart.items import FlipkartItem from scrapy.spider import BaseSpider from selenium import webdriver class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = [

Scrapy: No module named items, scraping images

匆匆过客 提交于 2019-12-24 12:34:15
问题 I'm trying a example that use scrapy to download images form a web pages. This is the spider file: from scrapy import Spider, Item, Field, Request from items import TrousersItem class TrouserScraper(Spider): name, start_urls = "Trousers_spider", ["http://lookatmyfuckingredtrousers.blogspot.co.uk"] def parse(self, response): for image in response.selector.xpath('//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'): yield TrousersItem(image_urls=[image.extract(

Scrapy needs to crawl all the next links on website and move on to the next page

允我心安 提交于 2019-12-24 12:22:11
问题 I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from delh.items import DelhItem class criticspider(CrawlSpider): name ="delh" allowed_domains =["consumercomplaints.in"] #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/

How to install portia, a python application from Github (Mac)

不羁岁月 提交于 2019-12-24 12:09:14
问题 I am attempting to install Portia, a python app from Github: https://github.com/scrapinghub/portia I use the following steps at the command line: set up new virtualenv 'portia' in Mac terminal git clone https://github.com/scrapinghub/portia.git follow readme instructions: cd slyd pip install -r requirements.txt run Portia cd slyd twistd -n slyd But every time I attempt the last step to run the program, I get the following error: ImportError: No module named scrapy Any idea why this error is

Scrapy results are repeating

淺唱寂寞╮ 提交于 2019-12-24 12:00:47
问题 I am trying to get names of the songs from this site https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html using link extractor but the results are repeating. import scrapy from scrapy import Request from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class RedditSpider(CrawlSpider): name='pagalworld' allowed_domains = ["pagalworld.me"] start_urls=['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi