nokogiri

Nokogiri adds characters during parsing on Heroku

假装没事ソ 提交于 2019-12-08 06:05:15
问题 It seems like Nokogiri has a problem with UTF-8 conversion of the nbsp character. I've gathered this is an issue related to LibXML2. Nokogiri recommends upgrading LibXML2 to 2.7.7 instead of 2.7.6 that's running on Heroku. Anyone know how I can use LibXML2 2.7.7 (or higher) on Heroku? The problem is as follows -- doc = Nokogiri::HTML("<html><p>Hi Hello</p></html>") doc.inner_html => "<html><body><p>Hi Hello</p></body></html>" doc.inner_html = "<p>Hello World</p>" => "<p>Hello World</p>" doc

How do I run all rake tasks?

十年热恋 提交于 2019-12-08 04:24:00
问题 Have just installed whenever gem https://github.com/javan/whenever to run my rake tasks, which are nokogiri / feedzilla dependent scraping tasks. eg my tasks are called grab_bbc, grab_guardian etc My question - as I update my site, I keep add more tasks to scheduler.rake. What should I write in my config/schedule.rb to make all rake tasks run, no matter what they are called? Would something like this work? every 12.hours do rake:task.each do |task| runner task end end Am new to Cron, using

Use XPath to group siblings from an HTML/XML document?

穿精又带淫゛_ 提交于 2019-12-08 00:23:57
问题 I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes. For example, I want to take the following fragment: <h2>Header</h2> <p>First paragraph</p> <p>Second paragraph</p> <h2>Second header</h2> <p>Third paragraph</p> <p>Fourth paragraph</p> Into this: <section> <h2>Header</h2> <p>First paragraph</p> <p>Second paragraph</p> </section> <section> <h2>Second header</h2> <p>Third paragraph</p> <p>Fourth paragraph</p> </section> Is this possible using simple

nokogiri: how to insert tbody tag immediately after table tag?

自古美人都是妖i 提交于 2019-12-08 00:08:18
问题 i want to make sure all table's immediate child is tbody.... how can i write this with xpath or nokogiri ? doc.search("//table/").each do |j| new_parent = Nokogiri::XML::Node.new('tbody',doc) j.replace new_parent new_parent << j end 回答1: require 'rubygems' require 'nokogiri' html = Nokogiri::HTML(DATA) html.xpath('//table').each do |table| # Remove all existing tbody tags to avoid nesting them. table.xpath('tbody').each do |existing_tbody| existing_tbody.swap(existing_tbody.children) end

nokogiri xpath attribute - strange results

只愿长相守 提交于 2019-12-07 23:02:05
问题 I have a bunch of fields and when I try to run: src.xpath('//RECORD').each do |record| tbegin = record.xpath('//FIELD/TOKEN') the tbegin array returns the fields from other records. I've checked that the first line is giving me the appropriate array of "record" subtrees, but the next call for tbegin doesn't limit the search to just the "record" subtree. In fact, it consistently returns the field subtree of record[0] . Thus far, I've gotten around this by using: tbegin = record.css('TOKEN')

Extracting elements with Nokogiri

两盒软妹~` 提交于 2019-12-07 22:53:15
问题 Was wondering if someone could help out with the following. I am using Nokogiri to scrape some data from http://www.bbc.co.uk/sport/football/tables I would like to get the league table info, so far ive got this def get_league_table # Get me Premier League Table doc = Nokogiri::HTML(open(FIXTURE_URL)) table = doc.css('.table-stats') teams = table.xpath('following-sibling::*[1]').css('tr.team') teams.each do |team| position = team.css('.position-number').text.strip League.create!(position:

How to split a HTML document using Nokogiri?

徘徊边缘 提交于 2019-12-07 17:20:32
问题 Right now, I'm splitting the HTML document to small pieces like this: (regular expression simplified - skipping header tag content and closing tag) document.at('body').inner_html.split(/<\s*h[2-6][^>]*>/i).collect do |fragment| Nokogiri::HTML(fragment) end Is there an easier way to perform that splitting? The document is very simple, just headers, paragraphs and formatted text in it. For example: <body> <h1>Main</h1> <h2>Sub 1</h2> <p>Text</p> ----- <h2>Sub 2</h2> <p>Text</p> ----- <h3>Sub 2

How to tidy up malformed xml in ruby

送分小仙女□ 提交于 2019-12-07 13:04:57
问题 I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database. For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained. I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag

Nokogiri Error: undefined method `radiobutton_with' - Why?

时间秒杀一切 提交于 2019-12-07 12:59:50
问题 I try to access a form using mechanize (Ruby). On my form I have a gorup of Radiobuttons. So I want to check one of them. I wrote: target_form = (page/:form).find{ |elem| elem['id'] == 'formid'} target_form.radiobutton_with(:name => "radiobuttonname")[2].check In this line I want to check the radiobutton with the value of 2. But in this line, I get an error: : undefined method `radiobutton_with' for #<Nokogiri::XML::Element:0x9b86ea> (NoMethodError) 回答1: The problem occured because using a

Screen scraping in clojure

纵然是瞬间 提交于 2019-12-07 12:29:49
问题 I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors. I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is: Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better