nokogiri

nokogiri: how to insert tbody tag immediately after table tag?

你说的曾经没有我的故事 提交于 2019-12-06 10:34:06
i want to make sure all table's immediate child is tbody.... how can i write this with xpath or nokogiri ? doc.search("//table/").each do |j| new_parent = Nokogiri::XML::Node.new('tbody',doc) j.replace new_parent new_parent << j end require 'rubygems' require 'nokogiri' html = Nokogiri::HTML(DATA) html.xpath('//table').each do |table| # Remove all existing tbody tags to avoid nesting them. table.xpath('tbody').each do |existing_tbody| existing_tbody.swap(existing_tbody.children) end tbody = html.create_element('tbody') tbody.children = table.children table.children = tbody end puts html.xpath(

How to extract text from <script> tag by using nokogiri and mechanize?

落花浮王杯 提交于 2019-12-06 09:27:38
this is a part of the source code of a bookings web site: <script> booking.ensureNamespaceExists('env'); booking.env.b_map_center_latitude = 53.36480155016638; booking.env.b_map_center_longitude = -2.2752803564071655; booking.env.b_hotel_id = '35523'; booking.env.b_query_params_no_ext = '?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaFCIAQGYAS64AQTIAQTYAQHoAQH4AQs;sid=e1c9e4c7a000518d8a3725b9bb6e5306;dcid=1'; </script> And I want to extract booking.env.b_hotel_id . So that i would get the value of '25523'. How do I achieve this with nokogiri and mechanize? Hope somebody can help! thanks! :) Jason

How do I scrape HTML between two HTML comments using Nokogiri?

纵饮孤独 提交于 2019-12-06 09:12:48
I have some HTML pages where the contents to be extracted are marked with HTML comments like below. <html> ..... <!-- begin content --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> ... </html> I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments. I want to extract the full elements between these two HTML comments: <div>some text</div> <div><p>Some more elements</p></div> I can get the text-only version using this characters callback: class TextExtractor < Nokogiri::XML::SAX::Document def

How do I scrape data from a page that loads specific data after the main page load?

不羁的心 提交于 2019-12-06 09:09:33
问题 I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358 My script looks like this right now: require 'rubygems' require 'nokogiri' require 'open-uri' page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL

how to get horizontal depth of a node?

冷暖自知 提交于 2019-12-06 09:02:01
note i made up the term horizontal depth to measure the sub-dimension of a node within a tree. so imagine a which would have xpath something like /html/table/tbody/tr/td, and "horizontal depth" of 5 i am trying to see if there is a way to identify and select elements based on this horizontal depth. how can i find the maximum depth ? If you need all the nodes with depth >= 5: /*/*/*/*//* And if you need all the nodes with depth == 5: /*/*/*/*/* Actually, there is a XPath function count , which you can combine with ancestor axis: //*[count(ancestor::*) >= 4] I think that "vertical depth" and

nokogiri xpath attribute - strange results

人走茶凉 提交于 2019-12-06 08:04:45
I have a bunch of fields and when I try to run: src.xpath('//RECORD').each do |record| tbegin = record.xpath('//FIELD/TOKEN') the tbegin array returns the fields from other records. I've checked that the first line is giving me the appropriate array of "record" subtrees, but the next call for tbegin doesn't limit the search to just the "record" subtree. In fact, it consistently returns the field subtree of record[0] . Thus far, I've gotten around this by using: tbegin = record.css('TOKEN') but I want to understand what I'm doing wrong. The problem is the leading double-slash in xpath('//FIELD

Ruby 2 Upgrade Breaks Nokogiri and/or open-uri Encoding?

好久不见. 提交于 2019-12-06 05:42:13
I have a mystery to solve when upgrading our Rails3.2 Ruby 1.9 app to a Rails3.2 Ruby 2.1.2 one. Nokogiri seems to break, in that it changes its behavior using open-uri. No gem versions are changed, just the ruby version (this is all on OSX Mavericks, using brew, gcc4 etc). Steps to reproduce: $ ruby -v ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-darwin13.1.0] $ rails console Connecting to database specified by database.yml Loading development environment (Rails 3.2.18) > feed = Nokogiri::XML(open(URI.encode("http://anyblog.wordpress.org/feed/"))) => #(Document:0x3fcb82f08448 { name =

Extracting elements with Nokogiri

99封情书 提交于 2019-12-06 05:23:29
Was wondering if someone could help out with the following. I am using Nokogiri to scrape some data from http://www.bbc.co.uk/sport/football/tables I would like to get the league table info, so far ive got this def get_league_table # Get me Premier League Table doc = Nokogiri::HTML(open(FIXTURE_URL)) table = doc.css('.table-stats') teams = table.xpath('following-sibling::*[1]').css('tr.team') teams.each do |team| position = team.css('.position-number').text.strip League.create!(position: position) end end So i thought i would grab the .table-stats and then get each row in the table with a

Adjusting timeouts for Nokogiri connections

跟風遠走 提交于 2019-12-06 04:41:04
Why nokogiri waits for couple of secongs (3-5) when the server is busy and I'm requesting pages one by one, but when these request are in a loop, nokogiri does not wait and throws the timeout message. I'm using timeout block wrapping the request, but nokogiri does not wait for that time at all. Any suggested procedure on this? # this is a method from the eng class def get_page(url,page_type) begin timeout(10) do # Get a Nokogiri::HTML::Document for the page we’re interested in... @@doc = Nokogiri::HTML(open(url)) end rescue Timeout::Error puts "Time out connection request" raise end end # this

How to click link in Mechanize and Nokogiri?

那年仲夏 提交于 2019-12-06 04:12:00
问题 I'm using Mechanize to scrape Google Wallet for Order data. I am capturing all the data from the first page, however, I need to automatically link to subsequent pages to get more info. The #purchaseOrderPager-pagerNextButton will move to the next page so I can pick up more records to capture. The element looks like this. I need to click on it to keep going. <a id="purchaseOrderPager-pagerNextButton" class="kd-button small right" href="purchaseorderlist?startTime=0&... ;currentPageStart=1