hpricot | 易学教程

hpricot with firebug's XPath

阅读更多关于 hpricot with firebug's XPath

问题 I'm trying to extract some info from a table based website with hpricot. I get the XPath with FireBug. /html/body/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table[3]/tbody/tr This doesn't work... Apparently, the FireBug's XPath, is the path of the rendered HTML, and no the actual HTML from the site. I read that removing tbody may resolve the problem. I try with: /html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr And

Tbody tag in xpath produced by fire bug

阅读更多关于 Tbody tag in xpath produced by fire bug

I'm trying to extract some data from online htmls using ruby hpricot library. I use the firefox extension fire bug to get the xpath of a selected item. There's always the extra tbody tag present in the produced xpath expression. In some cases, I must remove the tbody tag from the expression to obtain the results while in other cases, I must keep the tag to get the results. I just can't figure out when to keep the tbody tag and when not to. In order to take into account and avoid this problem, use XPath expressions of the following kind : /locStep1/locStep2/.../table/YourSubExpression |

hpricot with firebug's XPath

阅读更多关于 hpricot with firebug's XPath

I'm trying to extract some info from a table based website with hpricot. I get the XPath with FireBug. /html/body/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table[3]/tbody/tr This doesn't work... Apparently, the FireBug's XPath, is the path of the rendered HTML, and no the actual HTML from the site. I read that removing tbody may resolve the problem. I try with: /html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr And still doesn't work... I do a little more research, and some people report they get their XPath removing

Using Ruby with Mechanize to log into a website

阅读更多关于 Using Ruby with Mechanize to log into a website

I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require 'rubygems' require 'mechanize' a = Mechanize.new a.get('http://rubyforge.org/') do |page| # Click the login link login_page = a.click(page.link_with(:text => /Log In/)) # Submit the login form my_page = login_page.form_with(:action => '/account/login.php') do |f| f.form_loginname = ARGV[0] f.form_pw = ARGV[1] end.click_button my_page.links.each do |link|

How do I do a regex search in Nokogiri for text that matches a certain beginning?

阅读更多关于 How do I do a regex search in Nokogiri for text that matches a certain beginning?

Given: require 'rubygems' require 'nokogiri' value = Nokogiri::HTML.parse(<<-HTML_END) "<html> <body> <p id='para-1'>A</p> <div class='block' id='X1'> <h1>Foo</h1> <p id='para-2'>B</p> </div> <p id='para-3'>C</p> <h2>Bar</h2> <p id='para-4'>D</p> <p id='para-5'>E</p> <div class='block' id='X2'> <p id='para-6'>F</p> </div> </body> </html>" HTML_END I want to do something like what I can do in Hpricot: divs = value.search('//div[@id^="para-"]') How do I do a pattern search for elements in XPath style? Where would I find the documentation to help me? I didn't see this in the rdocs. Aaron

Nokogiri vs Hpricot?

阅读更多关于 Nokogiri vs Hpricot?

Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory footprint (runtime, not the code-base). Marc-André Lafortune Pick Nokogiri, for all points and especially point one: Hpricot is no longer maintained . Meta answer: See ruby-toolbox to get an idea of the popularity of different tools in a given area. SztupY Only pick Hpricot if you don't have, or can't install, LibXML on the computer you're using. If

Rails Bundler on windows refuses to install hpricot (even on manual gem install get Error: no such file to load — hpricot)

阅读更多关于 Rails Bundler on windows refuses to install hpricot (even on manual gem install get Error: no such file to load — hpricot)

问题 Upgraded to rails 3, and using Bundler for gems, in a mixed platform development group. I am on Windows. When I run Bundle Install it completes succesfully but will not install hpricot. The hpricot line is: gem "hpricot", "0.8.3", :platform => :mswin also tried gem "hpricot", :platform => :mswin Both complete fine but when I try to do a "bundle show hpricot" I get: Could not find gem 'hpricot' in the current bundle. If I do a run a rails console and try "require 'hpricot'" I get: LoadError:

Rails Bundler on windows refuses to install hpricot (even on manual gem install get Error: no such file to load — hpricot)

阅读更多关于 Rails Bundler on windows refuses to install hpricot (even on manual gem install get Error: no such file to load — hpricot)

Upgraded to rails 3, and using Bundler for gems, in a mixed platform development group. I am on Windows. When I run Bundle Install it completes succesfully but will not install hpricot. The hpricot line is: gem "hpricot", "0.8.3", :platform => :mswin also tried gem "hpricot", :platform => :mswin Both complete fine but when I try to do a "bundle show hpricot" I get: Could not find gem 'hpricot' in the current bundle. If I do a run a rails console and try "require 'hpricot'" I get: LoadError: no such file to load -- hpricot I have manually installed hpricot as well, and still get the above error

Strip text from HTML document using Ruby

阅读更多关于 Strip text from HTML document using Ruby

问题 There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly. What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes. I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the