scraper

How to Parse this HTML with Web::Scraper?

流过昼夜 提交于 2019-12-24 00:42:31
问题 I am trying to use Web::Scraper to parse the following HTML: <div> <p><strong>TITLE1</strong> <br> DESCRIPTION1 </p> <p><strong>TITLE2</strong> <br> DESCRIPTION2 </p> <p><strong>TITLE3</strong> <br> DESCRIPTION3 </p> </div> into 'test' => [ { 'name' => 'TITLE1', 'desc' => 'DESCRIPTION1 ' }, { 'name' => 'TITLE2', 'desc' => 'DESCRIPTION2 ' }, { 'name' => 'TITLE3', 'desc' => 'DESCRIPTION3 ' } ] I have the following code but I don't have much luck. 'TEXT' when processing 'p' gives both the text

delay in a for loop for http request

风格不统一 提交于 2019-12-23 16:09:12
问题 I am just getting started with JS and Node.js. I am trying to build a simple scraper as first project, using Node.js and some modules such as request and cheerio . I would like to add a 5 secs delay between each http request for each domain contained into the array. Can you explain me how to do it? Here is my code: var request = require('request'); var arr = [ "http://allrecipes.com/", "http://www.gossip.fr/" ]; for(var i=0; i < arr.length; i++) { request(arr[i], function (error, response,

Python selenium get inside a #document

百般思念 提交于 2019-12-22 08:35:15
问题 How can I keep looking for elements in a #document: <div> <iframe> #document <html> <body> <div> Element I want to find </div> </body> </html> </iframe> </div> 回答1: I think your problem is not with the a# document but with iframe . from selenium import webdriver driver = webdriver.Firefox() iframe = driver.find_elements_by_tag_name('iframe')[0] driver.switch_to_frame(iframe) driver.find_element_by_xpath("//div") 来源: https://stackoverflow.com/questions/38363643/python-selenium-get-inside-a

XPath to select between two HTML comments is not working?

一个人想着一个人 提交于 2019-12-20 07:13:35
问题 I'm trying to select some content between two HTML comments, but having some trouble getting it right (as seen in "XPath to select between two HTML comments?"). There seems to be a problem when new comments that are on the same line. My HTML: <html> ........ <!-- begin content --> <div>some text</div> <div> <p>Some more elements</p> </div> <!-- end content --><!-- begin content --> <div>more text</div> <!-- end content --> ....... </html> I use: doc.xpath("//node()[preceding-sibling::comment(

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

南楼画角 提交于 2019-12-19 02:43:04
问题 Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) 回答1: You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for

Facebook scraper doesn't load dynamic meta-tags

只谈情不闲聊 提交于 2019-12-18 07:39:42
问题 I am creating the HTML meta-tags dynamically using the function below (GWT). It takes 1 second to have this on the DOM. It is working fine except for Facebook. When I share a link from my web, the scraper gets the meta-tags that are in the HTML: none. How can I fix this? /** * Include the HTML attributes: title, description and keywords (meta tags) */ private void createHTMLheader(MyClass thing) { String title=thing.getTitle(); String description=thing.getDescription(); Document.get()

FF Xpather to Nokogiri — Can I just copy and paste?

你离开我真会死。 提交于 2019-12-12 04:39:46
问题 I was doing this manually and then I got stuck and I can't figure out why it's not working. I downloaded xpather and it is giving me: /html/body/center/table/tbody/tr[3]/td/table as the path to the item I want. I have manually confirmed that this is correct but when I paste it into my code, all it does is return nil Here is my code: a = parentdoc.at_xpath("//html/body/center/table/tbody/tr[3]/td/table[1]") puts a If I do something like this: a = parentdoc.at_xpath("//html/body/center") puts a

Ruby scraper. How to export to CSV?

。_饼干妹妹 提交于 2019-12-11 05:25:43
问题 I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown: scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError) I do not understand this piece of code. What's this doing and why isn't it working right? send_data csv_data, :type => 'text/csv; charset=iso-8859-1; header=present', :disposition

Scraper fails on files over ~390KB

与世无争的帅哥 提交于 2019-12-11 03:58:02
问题 Does the Facebook's URL scarper have a size limitation on it? We have several books available on a website. Those that have an HMTL filesize under a certain size (~390KB) get scraped and read properly but the 4 that are larger do not. These larger items get a 200 response code and the canonical URL opens. All of these pages are built using the same template, the only differences being the size of the content within each book and the number of links each book makes to other pages on the site.

Scraping sites with javascript screen delay [closed]

可紊 提交于 2019-12-10 15:15:52
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I'm attempting to scrape a site that has a split second javascript delay. I'm currently using python for scraping. Whenever I 'get' the page, the javascript delay has not finished and is has not completely loaded the new dom yet. How would I scrape such a pge? 回答1: You can extend