scraper | 易学教程

How to Parse this HTML with Web::Scraper?

阅读更多关于 How to Parse this HTML with Web::Scraper?

问题 I am trying to use Web::Scraper to parse the following HTML: <div> <p><strong>TITLE1</strong> <br> DESCRIPTION1 </p> <p><strong>TITLE2</strong> <br> DESCRIPTION2 </p> <p><strong>TITLE3</strong> <br> DESCRIPTION3 </p> </div> into 'test' => [ { 'name' => 'TITLE1', 'desc' => 'DESCRIPTION1 ' }, { 'name' => 'TITLE2', 'desc' => 'DESCRIPTION2 ' }, { 'name' => 'TITLE3', 'desc' => 'DESCRIPTION3 ' } ] I have the following code but I don't have much luck. 'TEXT' when processing 'p' gives both the text

delay in a for loop for http request

阅读更多关于 delay in a for loop for http request

问题 I am just getting started with JS and Node.js. I am trying to build a simple scraper as first project, using Node.js and some modules such as request and cheerio . I would like to add a 5 secs delay between each http request for each domain contained into the array. Can you explain me how to do it? Here is my code: var request = require('request'); var arr = [ "http://allrecipes.com/", "http://www.gossip.fr/" ]; for(var i=0; i < arr.length; i++) { request(arr[i], function (error, response,

Python selenium get inside a #document

阅读更多关于 Python selenium get inside a #document

问题 How can I keep looking for elements in a #document: <div> <iframe> #document <html> <body> <div> Element I want to find </div> </body> </html> </iframe> </div> 回答1: I think your problem is not with the a# document but with iframe . from selenium import webdriver driver = webdriver.Firefox() iframe = driver.find_elements_by_tag_name('iframe')[0] driver.switch_to_frame(iframe) driver.find_element_by_xpath("//div") 来源： https://stackoverflow.com/questions/38363643/python-selenium-get-inside-a

XPath to select between two HTML comments is not working?

阅读更多关于 XPath to select between two HTML comments is not working?

问题 I'm trying to select some content between two HTML comments, but having some trouble getting it right (as seen in "XPath to select between two HTML comments?"). There seems to be a problem when new comments that are on the same line. My HTML: <html> ........  <div>some text</div> <div> <p>Some more elements</p> </div>  <div>more text</div>  ....... </html> I use: doc.xpath("//node()[preceding-sibling::comment(

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

阅读更多关于 How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

问题 Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) 回答1: You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for

Facebook scraper doesn't load dynamic meta-tags

阅读更多关于 Facebook scraper doesn't load dynamic meta-tags

问题 I am creating the HTML meta-tags dynamically using the function below (GWT). It takes 1 second to have this on the DOM. It is working fine except for Facebook. When I share a link from my web, the scraper gets the meta-tags that are in the HTML: none. How can I fix this? /** * Include the HTML attributes: title, description and keywords (meta tags) */ private void createHTMLheader(MyClass thing) { String title=thing.getTitle(); String description=thing.getDescription(); Document.get()

FF Xpather to Nokogiri — Can I just copy and paste?

阅读更多关于 FF Xpather to Nokogiri — Can I just copy and paste?

问题 I was doing this manually and then I got stuck and I can't figure out why it's not working. I downloaded xpather and it is giving me: /html/body/center/table/tbody/tr[3]/td/table as the path to the item I want. I have manually confirmed that this is correct but when I paste it into my code, all it does is return nil Here is my code: a = parentdoc.at_xpath("//html/body/center/table/tbody/tr[3]/td/table[1]") puts a If I do something like this: a = parentdoc.at_xpath("//html/body/center") puts a

Ruby scraper. How to export to CSV?

阅读更多关于 Ruby scraper. How to export to CSV?

问题 I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown: scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError) I do not understand this piece of code. What's this doing and why isn't it working right? send_data csv_data, :type => 'text/csv; charset=iso-8859-1; header=present', :disposition

Scraper fails on files over ~390KB

阅读更多关于 Scraper fails on files over ~390KB

问题 Does the Facebook's URL scarper have a size limitation on it? We have several books available on a website. Those that have an HMTL filesize under a certain size (~390KB) get scraped and read properly but the 4 that are larger do not. These larger items get a 200 response code and the canonical URL opens. All of these pages are built using the same template, the only differences being the size of the content within each book and the number of links each book makes to other pages on the site.

Scraping sites with javascript screen delay [closed]

阅读更多关于 Scraping sites with javascript screen delay [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I'm attempting to scrape a site that has a split second javascript delay. I'm currently using python for scraping. Whenever I 'get' the page, the javascript delay has not finished and is has not completely loaded the new dom yet. How would I scrape such a pge? 回答1: You can extend