nokogiri | 易学教程

How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object

阅读更多关于 How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object

问题 While parsing an indented XML, non-significant white space text nodes are created from the white spaces between a closing and an opening tag. For example, from the following XML: <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> whose string representation is as follows, "<note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don't forget me this weekend!</body>\n</note>\n" the following Document is

What is the absolutely cheapest way to select a child node in Nokogiri?

阅读更多关于 What is the absolutely cheapest way to select a child node in Nokogiri?

I know that there are dozens of ways to select the first child element in Nokogiri, but which is the cheapest? I can't get around using Node#children, which sounds awfully expensive. Say that there are 10000 child nodes, and I don't want to touch the 9999 others... Node#child is the fastest way to get the first child element. However, if the node you're looking for is NOT the first (e.g., the 99th), then there is no faster way to select that node than to call #children and index into it. You are correct in stating that it's expensive to build a NodeSet for all children if you only want the

Nokogiri parsing multiple XML feeds at once and sort by date

阅读更多关于 Nokogiri parsing multiple XML feeds at once and sort by date

问题 I am using Rails and Nokogiri to parse some XML feeds. I have parsed one XML feed, and I want to parse multiple feeds and sort the items by date. They are Wordpress feeds so they have the same structure. In my controller I have: def index doc = Nokogiri::XML(open('http://somewordpressfeed')) @content = doc.xpath('//item').map do |i| {'title' => i.xpath('title').text, 'url' => i.xpath('link').text, 'date' => i.xpath('pubDate').text.to_datetime} end end In my view I have: <ul> <% @content.each

How do I integrate these two conditions block codes to mine in Ruby?

阅读更多关于 How do I integrate these two conditions block codes to mine in Ruby?

问题 How do I integrate these two conditions if my code scrapes without them? My code is working already, but it scrapes all rows (non-bold and bold values) and doesn't scrape the title attribute string. Condition 1: parses a table row only if one of its fields is bold: doc = Nokogiri::HTML(html) doc.xpath('//table[@class="articulos"]/tr[td[5]/p/b]').each do |row| puts row.at_xpath('td[3]/text()') end Condition2: gets only the number off the title attribute string : doc = Nokogiri::HTML(html)

How do I iterate through all records and pass database value to a variable?

阅读更多关于 How do I iterate through all records and pass database value to a variable?

问题 I have two tables, "Que" and "Opts". I want to iterate through all the records in Que and add them to the variables rikt_nr , start_nr and end_nr , because they will go on the end of a URL, which will look like: api.url.number=8-00001 How do I make it iterate through Que and pass rikt_nr , start_nr and end_nr to the rest of the code? The Que table has these fields: create_table "ques", force: true do |t| t.integer "rikt_nr" t.integer "start_nr" t.integer "end_nr" t.datetime "created_at" t

How do I parse XML using Nokogiri and split a node value?

阅读更多关于 How do I parse XML using Nokogiri and split a node value?

问题 I'm using Nokogiri to parse XML. doc = Nokogiri::XML("http://www.enhancetv.com.au/tvguide/rss/melbournerss.php") I wasn't sure how to actually retrieve node values correctly. I'm after the title , link , and description nodes in particular that sit under the item parent nodes. <item> <title>Toasted TV - TEN - 07:00:00 - 21/12/2011</title> <link>http://www.enhancetv.com.au/tvguide/</link> <description>Join the team for the latest in gaming, sport, gadgets, pop culture, movies, music and other

Find table in an array with the most rows using Ruby, Nokogiri and Mechanize

阅读更多关于 Find table in an array with the most rows using Ruby, Nokogiri and Mechanize

问题 @p = mechanize.get(url) tables = @p.search('table.someclass') I'm basically going over about 200 pages, putting the tables in an array and the only way to sort is to find the table with the greatest number of rows. So I want to be able to look at each item in the array and select the first item with the greatest number of rows. I've been trying to use max_by but that won't work because I'm needing to search the table that is the array item, to find the tr.count. 回答1: Two ways: biggest =

Nokogiri generating invalid HTML?

阅读更多关于 Nokogiri generating invalid HTML?

问题 I need to process an HTML document and insert some nodes in a few places. The content I'm processing is not valid, but Nokogiri is smart enough to figure out what it should be. The problem is that I don't want to change the original document's formatting, other than the pieces I'm inserting. Here is an example: require 'nokogiri' orig_html = ' <html> <meta name="Generator" content="Microsoft Word 97 O.o"> <body> 1 <b><p>2</p></b> 3 </body> </html>' puts Nokogiri::HTML(orig_html).inner_html #

Extract background-image from an HTML element in ruby

阅读更多关于 Extract background-image from an HTML element in ruby

问题 I am trying to extract background-url from a div using Nokogiri but am not able parse background-url of it. While Searching on StackOverflow I found this link Parsing: Can I pick up the URL of embedded CSS Background in Nokogiri? but the solution given there doesn't work. 回答1: Nokogiri is not a web browser. It stands on top of libxml2 to provide fast and excellent parsing of XML and HTML, and manipulation and extraction of data from this. It only deals with the HTML in a web page. It does not

Data scraping multiple page clicks loops

阅读更多关于 Data scraping multiple page clicks loops

问题 Trying to figure out a way to use one mechanise to scrape and add to arrays all of the data we want from the UCAS website. Currently we're struggling with coding in the link clicks for mechanise. Wondering if anyone can help, there are three successive link clicks amidst loops to progress through all search result pages. The first link to display all courses for university is within div class morecourseslink the second link to display course names, duration and qual is in div class