scrapy

Scrapy+Selenium 获取iframe下的document

心不动则不痛 提交于 2020-01-06 13:45:33
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 需求 :获取iframe h3 下的标题, img 的源,及 a 标签的落地页 需要先熟悉Selenium的同学: 点击学习 使用xpath获取iframe下的内容为空,如图 <iframe data-v-5a33f2b6="" id="preview-iframe-18769" class="idea-preview-iframe" style="height: 259.817px;" frameborder="0"></iframe> 可采用 execute_script 运行js获取,获取iframe下的document使用 [iframe标签].contentWindow.document # 当前iframe有多个,而且id是动态的。首先找到id temp_iframe_id = box.xpath('.//td[3]/div/div/div/iframe/@id').extract()[0] # 广告落地页 重试三次 因为是动态渲染,可能存在未渲染结束问题 for i in range(0, 3): try: item['landing_page'] = self.browser.execute_script( 'return document.getElementById("' + temp

Scrapy+Selenium clear()失效

北慕城南 提交于 2020-01-06 13:33:13
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 在使用 selenium 的时候,发现有些弹出窗上的输入框,输入文本后,使用 clear() 方法无效。 需要先熟悉Selenium的同学: 点击学习 比如切换登录账号时,退出登陆后进入重新登陆界面,账户的<input>编辑框默认填充的是前一个账号,clear()失效,这个时候使用send_keys会直接在后面追加,登录的账号不对导致登录失败 尝试过click()点击该输入框,再输入,发现还是无效,最终使用了 组合键双击-> Ctrl+a -> Delete 解决,当然也可以调js清空编辑框。 # 找到要删除的按钮 userName = self.wait.until(EC.presence_of_element_located((By.XPATH, './/div[@class="login-content"]//div[@class="el-tabs__content"]//form//div[@class="el-form-item"][./label/@for="username"]//input[@class="el-input__inner"]'))) # 判断编辑框的内容是否为空 不为空就开始全选清空 if userName.get_attribute("value"): # 单击

scrapy and selenium seem to intervene each other

大憨熊 提交于 2020-01-06 09:08:49
问题 Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code. Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any

scrapy and selenium seem to intervene each other

依然范特西╮ 提交于 2020-01-06 09:08:03
问题 Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code. Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any

By what library and how can I scrape texts on an HTML by its heading and paragraph tags?

久未见 提交于 2020-01-06 06:52:11
问题 My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (might be multiple), and output them as pairs. A simple HTML example can be: <h1>House rule</h1> <h2>Rule 1</h2> <p>A</p> <p>B</p> <h2>Rule 2</h2> <h3>Rule 2.1</h3> <p>C</p> <h3>Rule 2.2</h3> <p>D</p> For this example, I would like to have a output of pairs: Rule 2.2, D Rule 2.1, C Rule 2, D Rule 2, C House rule, D

Difference Between Public and Private Selector Methods

穿精又带淫゛_ 提交于 2020-01-06 06:26:32
问题 I'm just reading this documentation here and was curious: What is the difference between public and private methods in this context? To find multiple elements (these methods will return a list): find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector Apart from the public methods given above, there are two private methods which might be useful with locators in

Difference Between Public and Private Selector Methods

痞子三分冷 提交于 2020-01-06 06:26:05
问题 I'm just reading this documentation here and was curious: What is the difference between public and private methods in this context? To find multiple elements (these methods will return a list): find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector Apart from the public methods given above, there are two private methods which might be useful with locators in

Scrapy: Following pagination link to scrape data [duplicate]

北城余情 提交于 2020-01-06 05:43:11
问题 This question already has answers here : Scrapy: scraping data from Pagination (2 answers) Closed last year . I am trying to scrape data from a page and continue scraping following the pagination link. The page I am trying to scrape is --> here # -*- coding: utf-8 -*- import scrapy class AlibabaSpider(scrapy.Spider): name = 'alibaba' allowed_domains = ['alibaba.com'] start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1'] def parse(self, response): for

Scrapy - Grab all product details

。_饼干妹妹 提交于 2020-01-06 04:38:09
问题 I need to grab all Product Details (with green tickmarks) from this page: https://sourceforge.net/software/product/Budget-Maestro/ divs = response.xpath("//section[@class='row psp-section m-section-comm-details m-section-emphasized grey']/div[@class='list-outer column']/div") for div in divs: detail = div.xpath("./h3/text()").extract_first().strip() + ":" if detail!="Company Information:": divs2 = div.xpath(".//div[@class='list']/div") for div2 in divs2: dd = [val for val in div2.xpath(".

Empty list for hrefs to achieve pagination through JavaScript onclick functions

老子叫甜甜 提交于 2020-01-06 04:20:29
问题 My intension is to achieve the pagination from javascript functions , so for example I am taking the URL as http://events.justdial.com/events/index.php?city=Hyderabad , from this URL as you can see the pagination at the end of the page, so if you observe HTML of that they are written through JavaScript functions which has href tags as # , I am just trying to collect that href tags even though they are # . The following is my code class justdialdotcomSpider(BaseSpider): name = "justdialdotcom"