scrapy | 易学教程

Scrapy+Selenium 获取iframe下的document

阅读更多关于 Scrapy+Selenium 获取iframe下的document

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 需求：获取iframe h3 下的标题， img 的源，及 a 标签的落地页需要先熟悉Selenium的同学：点击学习使用xpath获取iframe下的内容为空，如图 <iframe data-v-5a33f2b6="" id="preview-iframe-18769" class="idea-preview-iframe" style="height: 259.817px;" frameborder="0"></iframe> 可采用 execute_script 运行js获取，获取iframe下的document使用 [iframe标签].contentWindow.document # 当前iframe有多个，而且id是动态的。首先找到id temp_iframe_id = box.xpath('.//td[3]/div/div/div/iframe/@id').extract()[0] # 广告落地页重试三次因为是动态渲染，可能存在未渲染结束问题 for i in range(0, 3): try: item['landing_page'] = self.browser.execute_script( 'return document.getElementById("' + temp

Scrapy+Selenium clear()失效

阅读更多关于 Scrapy+Selenium clear()失效

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 在使用 selenium 的时候，发现有些弹出窗上的输入框，输入文本后，使用 clear() 方法无效。需要先熟悉Selenium的同学：点击学习比如切换登录账号时，退出登陆后进入重新登陆界面，账户的<input>编辑框默认填充的是前一个账号，clear()失效，这个时候使用send_keys会直接在后面追加，登录的账号不对导致登录失败尝试过click()点击该输入框，再输入，发现还是无效，最终使用了组合键双击-> Ctrl+a -> Delete 解决，当然也可以调js清空编辑框。 # 找到要删除的按钮 userName = self.wait.until(EC.presence_of_element_located((By.XPATH, './/div[@class="login-content"]//div[@class="el-tabs__content"]//form//div[@class="el-form-item"][./label/@for="username"]//input[@class="el-input__inner"]'))) # 判断编辑框的内容是否为空不为空就开始全选清空 if userName.get_attribute("value"): # 单击

scrapy and selenium seem to intervene each other

阅读更多关于 scrapy and selenium seem to intervene each other

问题 Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code. Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any

scrapy and selenium seem to intervene each other

阅读更多关于 scrapy and selenium seem to intervene each other

By what library and how can I scrape texts on an HTML by its heading and paragraph tags?

阅读更多关于 By what library and how can I scrape texts on an HTML by its heading and paragraph tags?

问题 My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (might be multiple), and output them as pairs. A simple HTML example can be: <h1>House rule</h1> <h2>Rule 1</h2> <p>A</p> <p>B</p> <h2>Rule 2</h2> <h3>Rule 2.1</h3> <p>C</p> <h3>Rule 2.2</h3> <p>D</p> For this example, I would like to have a output of pairs: Rule 2.2, D Rule 2.1, C Rule 2, D Rule 2, C House rule, D

Difference Between Public and Private Selector Methods

阅读更多关于 Difference Between Public and Private Selector Methods

问题 I'm just reading this documentation here and was curious: What is the difference between public and private methods in this context? To find multiple elements (these methods will return a list): find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector Apart from the public methods given above, there are two private methods which might be useful with locators in

Difference Between Public and Private Selector Methods

阅读更多关于 Difference Between Public and Private Selector Methods

Scrapy: Following pagination link to scrape data [duplicate]

阅读更多关于 Scrapy: Following pagination link to scrape data [duplicate]

问题 This question already has answers here : Scrapy: scraping data from Pagination (2 answers) Closed last year . I am trying to scrape data from a page and continue scraping following the pagination link. The page I am trying to scrape is --> here # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1'] def parse(self, response): for

Scrapy - Grab all product details

阅读更多关于 Scrapy - Grab all product details

问题 I need to grab all Product Details (with green tickmarks) from this page: https://sourceforge.net/software/product/Budget-Maestro/ divs = response.xpath("//section[@class='row psp-section m-section-comm-details m-section-emphasized grey']/div[@class='list-outer column']/div") for div in divs: detail = div.xpath("./h3/text()").extract_first().strip() + ":" if detail!="Company Information:": divs2 = div.xpath(".//div[@class='list']/div") for div2 in divs2: dd = [val for val in div2.xpath(".

Empty list for hrefs to achieve pagination through JavaScript onclick functions

阅读更多关于 Empty list for hrefs to achieve pagination through JavaScript onclick functions

问题 My intension is to achieve the pagination from javascript functions , so for example I am taking the URL as http://events.justdial.com/events/index.php?city=Hyderabad , from this URL as you can see the pagination at the end of the page, so if you observe HTML of that they are written through JavaScript functions which has href tags as # , I am just trying to collect that href tags even though they are # . The following is my code class justdialdotcomSpider(BaseSpider): name = "justdialdotcom"