scrapy

Get content inside of script tag

做~自己de王妃 提交于 2021-02-19 03:57:22
问题 Hello everyone I'm trying to fetch content inside of script tag. http://www.teknosa.com/urunler/145051447/samsung-hm1500-bluetooth-kulaklik this is the website. Also this is script tag which I want to enter inside. $.Teknosa.ProductDetail = {"ProductComputedIndex":145051447,"ProductName":"SAMSUNG HM1500 BLUETOOTH KULAKLIK","ProductSeoName":"samsung-hm1500-bluetooth-kulaklik","ProductBarcode":"8808993790425","ProductPriceInclTax":79.9,"ProductDiscountedPriceInclTax":null,"ProductStockQuantity"

https证书

耗尽温柔 提交于 2021-02-19 01:58:40
Https访问时有两种情况: 1. 要爬取网站使用的可信任证书(默认支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取网站使用的自定义证书 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class

Scrapy parse javascript

眉间皱痕 提交于 2021-02-18 11:22:39
问题 I have a javascript on the page like below: new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli", i want to get "185310341". I am searching on google about a few hours but couldn't find anything, I hope u can help me. How can i scrape that javascript and get that id? I tried that code : id = sel.search('"id":(.*?),',text).group(1) print id but i got: exceptions.AttributeError: 'Selector' object has no attribute 'search' 回答1:

How to use ssl client certificate (p12) with Scrapy?

我与影子孤独终老i 提交于 2021-02-18 10:42:07
问题 I need to use client certificate file in format p12 (PKCS12) to talk to a webserver with scrapy, is there a way to do that ? 回答1: I can't offer you a tested and complete solution here, but I know a few places where some adjustments might give you what you need. The starting point is scrapy's ContextFactory object which defines the SSL/TLS configuration. The standard implementation ScrapyClientContextFactory doesn't use client certificates and also doesn't do any server certificate

【爬虫】selenium动态页面请求与模拟登录知乎

懵懂的女人 提交于 2021-02-18 04:05:55
一。安装selenium pip install selenium 二。安装相应浏览器的Driver(selenium 文档) http://selenium-python.readthedocs.io/api.html 推荐使用Chrome 三。selenium的使用 1 # -*- coding: utf-8 -*- 2 3 from selenium import webdriver 4 from scrapy.selector import Selector 5 6 7 # 知乎的模拟登录 8 browser = webdriver.Chrome(executable_path= " E:/chromedriver.exe " ) # 路径是chromedriver.exe的存放的位置 9 browser.get( " https://www.zhihu.com/#signin " ) 10 browser.find_element_by_css_selector( " .view-signin input[name='account'] " ).send_keys( " ******** " ) # 帐号 11 browser.find_element_by_css_selector( " .view-signin input[name='password'] " )

Python Scrapy 301 redirects

不打扰是莪最后的温柔 提交于 2021-02-17 20:53:23
问题 I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is: import scrapy import os from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'rust' allowed_domains = ['example.com'] start_urls = ['http://example.com'] rules = ( # Extract links matching 'category.php' (but not matching

Python Scrapy 301 redirects

独自空忆成欢 提交于 2021-02-17 20:53:07
问题 I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is: import scrapy import os from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'rust' allowed_domains = ['example.com'] start_urls = ['http://example.com'] rules = ( # Extract links matching 'category.php' (but not matching

CrawlerRunner not crawl pages with Crochet

。_饼干妹妹 提交于 2021-02-17 07:04:07
问题 I am trying to launch a Scrapy from script with CrawlerRunner() to launch in AWS Lambda. I watched in Stackoverflow the solution with crochet library, but it doesn´t work for me. Links: StackOverflow 1 StackOverflow 2 This is the code: import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings from scrapy.utils.log import configure_logging # From response in Stackoverflow: https://stackoverflow.com/questions/41495052/scrapy-reactor-not

Scrapy - Correct way to change User Agent in Request

旧时模样 提交于 2021-02-16 14:13:22
问题 I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this class CustomRetryMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): retries = request.meta.get('retry_times', 0) + 1 if retries <= self.max_retry_times: Proxy_UA_Middleware.switch_proxy() Proxy_UA_Middleware.switch_ua() logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s", {'request': request,

Unbalanced parenthesis error with Regex

六眼飞鱼酱① 提交于 2021-02-16 05:32:53
问题 I am using the following regex to obtain all data from a website Javascript data source that is contained within the following character pattern [[]]); The code I am using is this: regex = r'\[\[.*?\]]);' match2 = re.findall(regex, response.body, re.S) print match2 This is throwing up an error message of: raise error, v # invalid expression sre_constants.error: unbalanced parenthesis I think I am fairly safe in assuming that this is being caused by the closing bracket within my regex. How can