nokogiri | 易学教程

nokogiri: how to insert tbody tag immediately after table tag?

阅读更多关于 nokogiri: how to insert tbody tag immediately after table tag?

i want to make sure all table's immediate child is tbody.... how can i write this with xpath or nokogiri ? doc.search("//table/").each do |j| new_parent = Nokogiri::XML::Node.new('tbody',doc) j.replace new_parent new_parent << j end require 'rubygems' require 'nokogiri' html = Nokogiri::HTML(DATA) html.xpath('//table').each do |table| # Remove all existing tbody tags to avoid nesting them. table.xpath('tbody').each do |existing_tbody| existing_tbody.swap(existing_tbody.children) end tbody = html.create_element('tbody') tbody.children = table.children table.children = tbody end puts html.xpath(

How to extract text from <script> tag by using nokogiri and mechanize?

阅读更多关于 How to extract text from tag by using nokogiri and mechanize?

this is a part of the source code of a bookings web site: <script> booking.ensureNamespaceExists('env'); booking.env.b_map_center_latitude = 53.36480155016638; booking.env.b_map_center_longitude = -2.2752803564071655; booking.env.b_hotel_id = '35523'; booking.env.b_query_params_no_ext = '?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaFCIAQGYAS64AQTIAQTYAQHoAQH4AQs;sid=e1c9e4c7a000518d8a3725b9bb6e5306;dcid=1'; </script> And I want to extract booking.env.b_hotel_id . So that i would get the value of '25523'. How do I achieve this with nokogiri and mechanize? Hope somebody can help! thanks! :) Jason

How do I scrape HTML between two HTML comments using Nokogiri?

阅读更多关于 How do I scrape HTML between two HTML comments using Nokogiri?

I have some HTML pages where the contents to be extracted are marked with HTML comments like below. <html> .....  <div>some text</div> <div><p>Some more elements</p></div>  ... </html> I am using Nokogiri and trying to extract the HTML between the  and  comments. I want to extract the full elements between these two HTML comments: <div>some text</div> <div><p>Some more elements</p></div> I can get the text-only version using this characters callback: class TextExtractor < Nokogiri::XML::SAX::Document def

How do I scrape data from a page that loads specific data after the main page load?

阅读更多关于 How do I scrape data from a page that loads specific data after the main page load?

问题 I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358 My script looks like this right now: require 'rubygems' require 'nokogiri' require 'open-uri' page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL

how to get horizontal depth of a node?

阅读更多关于 how to get horizontal depth of a node?

note i made up the term horizontal depth to measure the sub-dimension of a node within a tree. so imagine a which would have xpath something like /html/table/tbody/tr/td, and "horizontal depth" of 5 i am trying to see if there is a way to identify and select elements based on this horizontal depth. how can i find the maximum depth ? If you need all the nodes with depth >= 5: /*/*/*/*//* And if you need all the nodes with depth == 5: /*/*/*/*/* Actually, there is a XPath function count , which you can combine with ancestor axis: //*[count(ancestor::*) >= 4] I think that "vertical depth" and

nokogiri xpath attribute - strange results

阅读更多关于 nokogiri xpath attribute - strange results

I have a bunch of fields and when I try to run: src.xpath('//RECORD').each do |record| tbegin = record.xpath('//FIELD/TOKEN') the tbegin array returns the fields from other records. I've checked that the first line is giving me the appropriate array of "record" subtrees, but the next call for tbegin doesn't limit the search to just the "record" subtree. In fact, it consistently returns the field subtree of record[0] . Thus far, I've gotten around this by using: tbegin = record.css('TOKEN') but I want to understand what I'm doing wrong. The problem is the leading double-slash in xpath('//FIELD

Ruby 2 Upgrade Breaks Nokogiri and/or open-uri Encoding?

阅读更多关于 Ruby 2 Upgrade Breaks Nokogiri and/or open-uri Encoding?

I have a mystery to solve when upgrading our Rails3.2 Ruby 1.9 app to a Rails3.2 Ruby 2.1.2 one. Nokogiri seems to break, in that it changes its behavior using open-uri. No gem versions are changed, just the ruby version (this is all on OSX Mavericks, using brew, gcc4 etc). Steps to reproduce: $ ruby -v ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-darwin13.1.0] $ rails console Connecting to database specified by database.yml Loading development environment (Rails 3.2.18) > feed = Nokogiri::XML(open(URI.encode("http://anyblog.wordpress.org/feed/"))) => #(Document:0x3fcb82f08448 { name =

Extracting elements with Nokogiri

阅读更多关于 Extracting elements with Nokogiri

Was wondering if someone could help out with the following. I am using Nokogiri to scrape some data from http://www.bbc.co.uk/sport/football/tables I would like to get the league table info, so far ive got this def get_league_table # Get me Premier League Table doc = Nokogiri::HTML(open(FIXTURE_URL)) table = doc.css('.table-stats') teams = table.xpath('following-sibling::*[1]').css('tr.team') teams.each do |team| position = team.css('.position-number').text.strip League.create!(position: position) end end So i thought i would grab the .table-stats and then get each row in the table with a

Adjusting timeouts for Nokogiri connections

阅读更多关于 Adjusting timeouts for Nokogiri connections

Why nokogiri waits for couple of secongs (3-5) when the server is busy and I'm requesting pages one by one, but when these request are in a loop, nokogiri does not wait and throws the timeout message. I'm using timeout block wrapping the request, but nokogiri does not wait for that time at all. Any suggested procedure on this? # this is a method from the eng class def get_page(url,page_type) begin timeout(10) do # Get a Nokogiri::HTML::Document for the page we’re interested in... @@doc = Nokogiri::HTML(open(url)) end rescue Timeout::Error puts "Time out connection request" raise end end # this

How to click link in Mechanize and Nokogiri?

阅读更多关于 How to click link in Mechanize and Nokogiri?

问题 I'm using Mechanize to scrape Google Wallet for Order data. I am capturing all the data from the first page, however, I need to automatically link to subsequent pages to get more info. The #purchaseOrderPager-pagerNextButton will move to the next page so I can pick up more records to capture. The element looks like this. I need to click on it to keep going. <a id="purchaseOrderPager-pagerNextButton" class="kd-button small right" href="purchaseorderlist?startTime=0&... ;currentPageStart=1