nokogiri

Ruby Nokogiri Javascript Parsing

…衆ロ難τιáo~ 提交于 2019-12-01 03:31:41
问题 I need to parse an array out of a website. The part of the Javascript I want to parse looks like this: _arPic[0] = "http://example.org/image1.jpg"; _arPic[1] = "http://example.org/image2.jpg"; _arPic[2] = "http://example.org/image3.jpg"; _arPic[3] = "http://example.org/image4.jpg"; _arPic[4] = "http://example.org/image5.jpg"; _arPic[5] = "http://example.org/image6.jpg"; I get the whole javascript by something like that: product_page = Nokogiri::HTML(open(full_url)) product_page.css("div#main

Scraping an AngularJS application

半世苍凉 提交于 2019-12-01 01:38:29
I'm scrapping some HTML pages with Rails, using Nokogiri. I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered. Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it? If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards. If you

Parse table using Nokogiri

拜拜、爱过 提交于 2019-12-01 00:14:34
I would like to parse a table using Nokogiri. I'm doing it this way def parse_table_nokogiri(html) doc = Nokogiri::HTML(html) doc.search('table > tr').each do |row| row.search('td/font/text()').each do |col| p col.to_s end end end Some of the table that I have have rows like this: <tr> <td> Some text </td> </tr> ...and some have this. <tr> <td> <font> Some text </font> </td> </tr> My XPath expression works for the second scenario but not the first. Is there an XPath expression that I could use that would give me the text from the innermost node of the cell so that I can handle both scenarios?

Escape single quote in XPath with Nokogiri?

有些话、适合烂在心里 提交于 2019-12-01 00:13:56
问题 I have an XPath query that looks like this, with both single and double quotes. How do I escape the apostrophe properly so that the query works? I tried: "//li[text()='Frank&apos;s car']" but it doesn't seem to do it for me. Any ideas? "//li[text()='Frank's car']" 回答1: XPath doesn’t have any way of escaping special characters, so this is a little tricky. A solution in this specific case would be to use double quotes instead of single quotes in the XPath expression: text()="Frank's car" If you

Nokogiri issues with Ruby on Rails

こ雲淡風輕ζ 提交于 2019-11-30 23:35:16
I'm trying to install nokogiri on my machine but I am receiving the following error: Building native extensions. This could take a while... ERROR: Error installing nokogiri: ERROR: Failed to build gem native extension. current directory: /Users/username/.rbenv/versions/2.0.0-p481/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.6.4/ext/nokogiri /Users/username/.rbenv/versions/2.0.0-p481/bin/ruby -r ./siteconf20151127-29540-11ahx4h.rb extconf.rb checking if the C compiler accepts ... *** extconf.rb failed *** Could not create Makefile due to some reason, probably lack of necessary libraries and/or headers

How do I wrap HTML untagged text with <p> tag using Nokogiri?

人走茶凉 提交于 2019-11-30 21:51:21
I have to parse an HTML document into different new files. The problem is that there are text nodes which have not been wrapped with "<p>" tags, instead they having "<br>" tags at the end of each paragraph. I want to wrap this text with <p> tags using Nokogiri: <div id="f15"><b>Footnote 15</b>: Catullus iii, 12.</div> <div class="pgmonospaced pgheader"><br/> <br/> End of the Project abc<br/> <br/> *** END OF THIS PROJECT XYZ ***<br/> <br/> ***** This file should be named new file.html... *****<br/> <br/></div> After searching around some forums and doing some debugging locally, i have found

Nokogiri parse ajax-loaded content

[亡魂溺海] 提交于 2019-11-30 20:45:29
问题 Is it possible for nokogiri to parse content loaded via ajax? If not how would I accomplish this? 回答1: Nokogiri can't see the AJAX content because it isn't a Javascript parser, and, as a result, can't interpret it and do the needed request. What you want is something like Watir, or one of its spinoffs, depending on your OS. They will launch a browser, which can process the Javascript and resulting AJAX request. Then, you can request the page's contents, and do your parsing of the DOM using

Preventing Nokogiri from escaping characters?

浪子不回头ぞ 提交于 2019-11-30 20:15:35
I have created a text node and inserted into my document like so: #<Nokogiri::XML::Text:0x3fcce081481c "<%= stylesheet_link_tag 'style'%>">]> When I try to save the document with this: File.open('ng.html', 'w+'){|f| f << page.to_html} I get this in the actual document: <%= stylesheet_link_tag 'style'%> Is there a way to disable the escaping and save my page with my erb tags intact? Thanks! You are obliged to escape some characters in text elements like: " " ' &apos; < < > > & & If you want your text verbatim use a CDATA section since everything inside a CDATA section is ignored by the parser.

Getting the siblings of a node with Nokogiri

六月ゝ 毕业季﹏ 提交于 2019-11-30 19:59:25
Is there a way to find a specific value in a node and then return all its sibling values? For example, I would like to find find the id node that contains ID 5678 and then get the email address and all images associated with ID 5678. Nokogiri::XML.parse(File.open('info.xml')) Here's a sample XML file. <xmlcontainer> <details> <id>1234</id> <email>sdfsdf@sdasd.com</email> <image>images/1.jpg</image> <image>images/2.jpg</image> <image>images/3.jpg</image> </details> <details> <id>5678</id> <email>zzzz@zzz.com</email> <image>images/4.jpg</image> <image>images/5.jpg</image> </details> <details>

Converting nested hash into XML using nokogiri

烂漫一生 提交于 2019-11-30 19:52:05
I have many levels of nested hash like: { :foo => 'bar', :foo1 => { :foo2 => 'bar2', :foo3 => 'bar3', :foo4 => { :foo5 => 'bar5' }}} How can I convert them into an XML like this?: <foo>bar</foo> <foo1> <foo2>bar2</foo2> <foo3>bar3</foo3> <foo4> <foo5>bar5</foo5> </foo4> </foo1> I have tried the xml.send method, but it converts the above nested hash to: <foo1 foo3="bar3" foo4="foo5bar5" foo2="bar2"/> <foo>bar</foo> How about this? class Hash def to_xml map do |k, v| text = Hash === v ? v.to_xml : v "<%s>%s</%s>" % [k, text, k] end.join end end h.to_xml #=> "<foo>bar</foo><foo1><foo2>bar2</foo2>