scrapy

Issue with scraping JS rendered page with Scrapy and Splash

空扰寡人 提交于 2019-12-24 11:59:53
问题 I'm trying to scrape this page which includes following html according to chrome <p class="title"> Orange Paired </p> this is my spider: import scrapy from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): name = "splash" allowed_domains = ["phillips.com"] start_urls = ["https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url, self.parse, endpoint='render.json', args={'har': 1, 'html': 1} ) def

urlparse: ModuleNotFoundError, presumably in Python2.7 and under conda

孤街醉人 提交于 2019-12-24 11:46:52
问题 I am attempting to run my own scrapy project. The code is based off a well written book and the author provides a great VM playground to run scripts exampled in the book. In the VM the code works fine. However, in an attempt to practice on my own, I received the following error: File "(frozen importlib._bootstrap)", line 978, in _gcd_import File "(frozen importlib._bootstrap)", line 961, in _find_and_load File "(frozen importlib._bootstrap)", line 950, in _find_and_load_unlocked File "(frozen

Recursive Scraping Craigslist with Scrapy and Python 2.7

陌路散爱 提交于 2019-12-24 11:33:41
问题 I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the rule as I know that's where the problem lies, but I either get just the first page, every page on craigslist, or nothing. Any help? Here's my current code: from scrapy.selector import HtmlXPathSelector from craigslist_sample.items import CraigslistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from

Refused to load the script because it violates the following Content Security Policy directive: script-src error with ChromeDriver Chrome and Selenium

我与影子孤独终老i 提交于 2019-12-24 10:44:17
问题 I am trying to scrape Phone Number from these links "https://www.practo.com/delhi/doctor/dr-meeka-gulati-dentist-3?specialization=Dentist&practice_id=722421" and "https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154" if element present it scrapes the phone number otherwise phone number is None Spider Code: from selenium import webdriver from selenium.webdriver.common.by import By from selenium

Scrapy twisted connection lost in non-clean fashion. No proxy. Already tried headers

耗尽温柔 提交于 2019-12-24 10:13:25
问题 I am trying to crawl this site https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs with scrapy and keep getting twisted request/disconnection errors. I am not using a proxy and I tried both setting the user agent and actually setting all the headers based on this answer here is the code generating the request def start_requests(self): url = 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs' headers = { 'Accept':

Scrapy returns more results than expected

∥☆過路亽.° 提交于 2019-12-24 09:48:58
问题 This is a continuation of the question: Extract from dynamic JSON response with Scrapy I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results). For example for 17 values provided in test.txt file it returns 289 results, that means 17 times more than expected. Spider content below: import scrapy import json from whois.items import WhoisItem class

How to design a scraper for companies such as owler? [closed]

我是研究僧i 提交于 2019-12-24 09:37:32
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I am trying to develop a scraper for various sites like angel.co . But I'm stuck at designing crawler for website www.owler.com as it requires login through mail when we try to access information about company. Each time we login we'll get a new login token on email that will

How to get immediate parent node with scrapy in python?

我的未来我决定 提交于 2019-12-24 08:46:12
问题 I am new to scrapy . I want to crawl some data from the web. I got the html document like below. dom style1: <div class="user-info"> <p class="user-name"> something in p tag </p> text data I want </div> dom style2: <div class="user-info"> <div> <p class="user-img"> something in p tag </p> something in div tag </div> <div> <p class="user-name"> something in p tag </p> text data I want </div> </div> I want to get the data text data I want , now I can use css or xpath selector to get it by check

Extract from dynamic JSON response with Scrapy

谁都会走 提交于 2019-12-24 08:15:04
问题 I want to extract the 'avail' value from the JSON output that look like this. { "result": { "code": 100, "message": "Command Successful" }, "domains": { "yolotaxpayers.com": { "avail": false, "tld": "com", "price": "49.95", "premium": false, "backorder": true } } } The problem is that the ['avail'] value is under ["domains"]["domain_name"] and I can't figure out how to get the domain name. You have my spider below. The first part works fine, but not the second one. import scrapy import json

Why does my scrapy not scrape anything?

落爺英雄遲暮 提交于 2019-12-24 08:00:18
问题 I don't know where the issues lies probably super easy to fix since I am new to scrapy. I hope to find a solution. Thanks in advance. I am using utnutu 14.04, python 3.4 My Spider: import scrapy from scrapy.linkextractors import LinkExtractor from name.items import Actress class ActressSpider(scrapy.Spider): name = "name_list" allowed_domains = ["dmm.co.jp"] start_urls = ["http://actress.dmm.co.jp/-/list/=/keyword=%s/" % c for c in ['a', 'i', 'u', 'e', 'o', 'ka', 'ki', 'ku', 'ke', 'ko', 'sa',