nokogiri | 易学教程

Nokogiri adds characters during parsing on Heroku

阅读更多关于 Nokogiri adds characters during parsing on Heroku

问题 It seems like Nokogiri has a problem with UTF-8 conversion of the nbsp character. I've gathered this is an issue related to LibXML2. Nokogiri recommends upgrading LibXML2 to 2.7.7 instead of 2.7.6 that's running on Heroku. Anyone know how I can use LibXML2 2.7.7 (or higher) on Heroku? The problem is as follows -- doc = Nokogiri::HTML("<html><p>Hi Hello</p></html>") doc.inner_html => "<html><body><p>Hi Hello</p></body></html>" doc.inner_html = "<p>Hello World</p>" => "<p>Hello World</p>" doc

How do I run all rake tasks?

阅读更多关于 How do I run all rake tasks?

问题 Have just installed whenever gem https://github.com/javan/whenever to run my rake tasks, which are nokogiri / feedzilla dependent scraping tasks. eg my tasks are called grab_bbc, grab_guardian etc My question - as I update my site, I keep add more tasks to scheduler.rake. What should I write in my config/schedule.rb to make all rake tasks run, no matter what they are called? Would something like this work? every 12.hours do rake:task.each do |task| runner task end end Am new to Cron, using

Use XPath to group siblings from an HTML/XML document?

阅读更多关于 Use XPath to group siblings from an HTML/XML document?

问题 I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes. For example, I want to take the following fragment: <h2>Header</h2> <p>First paragraph</p> <p>Second paragraph</p> <h2>Second header</h2> <p>Third paragraph</p> <p>Fourth paragraph</p> Into this: <section> <h2>Header</h2> <p>First paragraph</p> <p>Second paragraph</p> </section> <section> <h2>Second header</h2> <p>Third paragraph</p> <p>Fourth paragraph</p> </section> Is this possible using simple

nokogiri: how to insert tbody tag immediately after table tag?

阅读更多关于 nokogiri: how to insert tbody tag immediately after table tag?

问题 i want to make sure all table's immediate child is tbody.... how can i write this with xpath or nokogiri ? doc.search("//table/").each do |j| new_parent = Nokogiri::XML::Node.new('tbody',doc) j.replace new_parent new_parent << j end 回答1: require 'rubygems' require 'nokogiri' html = Nokogiri::HTML(DATA) html.xpath('//table').each do |table| # Remove all existing tbody tags to avoid nesting them. table.xpath('tbody').each do |existing_tbody| existing_tbody.swap(existing_tbody.children) end

nokogiri xpath attribute - strange results

阅读更多关于 nokogiri xpath attribute - strange results

问题 I have a bunch of fields and when I try to run: src.xpath('//RECORD').each do |record| tbegin = record.xpath('//FIELD/TOKEN') the tbegin array returns the fields from other records. I've checked that the first line is giving me the appropriate array of "record" subtrees, but the next call for tbegin doesn't limit the search to just the "record" subtree. In fact, it consistently returns the field subtree of record[0] . Thus far, I've gotten around this by using: tbegin = record.css('TOKEN')

Extracting elements with Nokogiri

阅读更多关于 Extracting elements with Nokogiri

问题 Was wondering if someone could help out with the following. I am using Nokogiri to scrape some data from http://www.bbc.co.uk/sport/football/tables I would like to get the league table info, so far ive got this def get_league_table # Get me Premier League Table doc = Nokogiri::HTML(open(FIXTURE_URL)) table = doc.css('.table-stats') teams = table.xpath('following-sibling::*[1]').css('tr.team') teams.each do |team| position = team.css('.position-number').text.strip League.create!(position:

How to split a HTML document using Nokogiri?

阅读更多关于 How to split a HTML document using Nokogiri?

问题 Right now, I'm splitting the HTML document to small pieces like this: (regular expression simplified - skipping header tag content and closing tag) document.at('body').inner_html.split(/<\s*h[2-6][^>]*>/i).collect do |fragment| Nokogiri::HTML(fragment) end Is there an easier way to perform that splitting? The document is very simple, just headers, paragraphs and formatted text in it. For example: <body> <h1>Main</h1> <h2>Sub 1</h2> <p>Text</p> ----- <h2>Sub 2</h2> <p>Text</p> ----- <h3>Sub 2

How to tidy up malformed xml in ruby

阅读更多关于 How to tidy up malformed xml in ruby

问题 I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database. For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained. I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag

Nokogiri Error: undefined method `radiobutton_with' - Why?

阅读更多关于 Nokogiri Error: undefined method `radiobutton_with' - Why?

问题 I try to access a form using mechanize (Ruby). On my form I have a gorup of Radiobuttons. So I want to check one of them. I wrote: target_form = (page/:form).find{ |elem| elem['id'] == 'formid'} target_form.radiobutton_with(:name => "radiobuttonname")[2].check In this line I want to check the radiobutton with the value of 2. But in this line, I get an error: : undefined method `radiobutton_with' for #<Nokogiri::XML::Element:0x9b86ea> (NoMethodError) 回答1: The problem occured because using a

Screen scraping in clojure

阅读更多关于 Screen scraping in clojure

问题 I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors. I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is: Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better