nokogiri | 易学教程

Data scraping multiple array creation and ordering

阅读更多关于 Data scraping multiple array creation and ordering

问题 We're trying to scrape the course names, qualification and duration of the course and store each in a separate array. With the below we pull all of that, but it seems to be in random order, with some parts potentially ordered by page etc. Wondering if anybody is able to help. require 'mechanize' mechanize = Mechanize.new @duration_array = [] @qual_array = [] @courses_array = [] page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1

Parsing XML to hash with Nori and Nokogiri with undesired result

阅读更多关于 Parsing XML to hash with Nori and Nokogiri with undesired result

问题 I am attempting to convert an XML document to a Ruby hash using Nori. But instead of receiving a collection of the root element, a new node containing the collection is returned. This is what I am doing: @xml = content_for(:layout) @hash = Nori.new(:parser => :nokogiri, :advanced_typecasting => false).parse(@xml) or @hash = Hash.from_xml(@xml) Where the content of @xml is: <bundles> <bundle> <id>6073</id> <name>Bundle-1</name> <status>1</status> <bundle_type> <id>6713</id> <name>BundleType-1<

Find a table containing specific text

阅读更多关于 Find a table containing specific text

问题 I have a table: html =' <table cellpadding="1" cellspacing="0" width="100%" border="0"> <tr> <td colspan="9" class="csoGreen"><b class="white">Bill Statement Detail</b></td> </tr> <tr style="background-color: #D8E4F6;vertical-align: top;"> <td nowrap="nowrap"><b>Bill Date</b></td> <td nowrap="nowrap"><b>Bill Amount</b></td> <td nowrap="nowrap"><b>Bill Due Date</b></td> <td nowrap="nowrap"><b>Bill (PDF)</b></td> </tr> </table> ' I use the codes suggested in this post (XPath matching text in a

Parsing: Can I pick up the URL of embedded CSS Background in Nokogiri?

阅读更多关于 Parsing: Can I pick up the URL of embedded CSS Background in Nokogiri?

问题 The HTML I am parsing contains images with inline CSS in a table, can I use Nokogiri to determine the URL component is, here is a snippet of code I'd like to parse: tldr: i'ld like to get the .png in this html snippet using nokogiri <table border="0" cellspacing="0" cellpadding="0" width="300" height="300" background="http://s3.amazonaws.com/static.example.com/sale/homepage/3166-300x300-1328107072.png" style="background-image:url('http://s3.amazonaws.com/static.example.com/sale/homepage/3166

Nokogiri HTML parsing not working

阅读更多关于 Nokogiri HTML parsing not working

问题 I am trying to parse some HTML with Nokogiri, but I am not getting anything back from the css or xpath methods. require 'rubygems' require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open("http://www.google.com")) doc.css('div').each do |div| puts div.content end doc.xpath('//div').each do |div| puts div.content end Nothing gets printed to the screen, so css and xpath are returning empty arrays. There are at least 100 divs in Google's homepage. doc.to_html returns: <!DOCTYPE html>\n\n

Get value from a HTTP GET response body via Nokogiri?

阅读更多关于 Get value from a HTTP GET response body via Nokogiri?

问题 I get this result from a HTTP page like: <!DOCTYPE html> <html> <head> <title>Captchaservice</title> </head> <body> 15 </body> </html> And I use this Nokogiri code: doc = Nokogiri::HTML( response ) id = doc.xpath('//').text But I get \n 15 \n etc. I tried to write: id = doc.xpath('//').text.to_i And I get this value, but when I use this ID I get: undefined method `empty?' for 15:Fixnum What am I doing wrong and how do I to get this integer value? 回答1: That's because your id is an instance of

Cannot installing mechanize for ruby on mac

阅读更多关于 Cannot installing mechanize for ruby on mac

问题 I am trying to install mechanize on a Mac OS X Version 10.7.3 with ruby version 1.8.7. The problem is with one of its dependencies nokogiri. I have seen other posts about having xcode installe and I do it is version 4.3.2 . Here is the error I am receiving. Thank you in advance. sudo gem install mechanize Building native extensions. This could take a while... ERROR: Error installing mechanize: ERROR: Failed to build gem native extension. /System/Library/Frameworks/Ruby.framework/Versions/1.8

:has CSS pseudo class in Nokogiri

阅读更多关于 :has CSS pseudo class in Nokogiri

问题 I'm looking for the pseudoclass :has in Nokogiri. It should work just like jQuery's has selector. For example: <li><h1><a href="dfd">ex1</a></h1><span class="string">sdfsdf</span></li> <li><h1><a href="dsfsdf">ex2</a></h1><span class="string"></span></li> <li><h1><a href="sdfd">ex3</a></h1></li> The CSS selector should return only the first link, the one with the not-empty span.string sibling. In jQuery this selector works well: $('li:has(span.string:not(:empty))>h1>a') but not in Nokogiri:

Nokogiri for selecting text and html between between unique sets of tags

阅读更多关于 Nokogiri for selecting text and html between between unique sets of tags

问题 I am trying to use Nokogiri to extract the text in-between two unique sets of tags. What is the best way to get the text within the p-tag in between <h2 class="point">The problem</h2> and <h2 class="point">The solution</h2> , and then all of the HTML between <h2 class="point">The solution</h2> and <div class="frame box sketh"> ? Sample of the full html: <h2 class="point">The problem</h2> <p>TEXT I WANT </p> <h2 class="point">The solution</h2> HTML I WANT with it's own set of tags (but never

How do I scrape data through Mechanize and Nokogiri?

阅读更多关于 How do I scrape data through Mechanize and Nokogiri?

问题 I am working on an application which gets the HTML from http://www.screener.in/. I can enter a company name like "Atul Auto Ltd" and submit it and, from the next page, scrape the following details: "CMP/BV" and "CMP". I am using this code: require 'mechanize' require 'rubygems' require 'nokogiri' Company_name='Atul Auto Ltd.' agent = Mechanize.new page = agent.get('http://www.screener.in/') form = agent.page.forms[0] print agent.page.forms[0].fields agent.page.forms[0]["q"]=Company_name