hpricot

XML => HTML with Hpricot and Rails

倾然丶 夕夏残阳落幕 提交于 2019-12-14 02:29:37
问题 I have never worked with web services and rails, and obviously this is something I need to learn. I have chosen to use hpricot because it looks great. Anyway, _why's been nice enough to provide the following example on the hpricot website: #!ruby require 'hpricot' require 'open-uri' # load the RedHanded home page doc = Hpricot(open("http://redhanded.hobix.com/index.html")) # change the CSS class on links (doc/"span.entryPermalink").set("class", "newLinks") # remove the sidebar (doc/"#sidebar"

Parse XML with JRuby (Hpricot?) with tags like <foo.bar>

我怕爱的太早我们不能终老 提交于 2019-12-13 15:11:29
问题 I'm trying to consume some legacy XML with elements like this in JRuby: <x-doc attr="value"> <nested> <with.dot>content</with.dot > </nested> </x-doc> I've been working with Hpricot, but Hpricot's HTML-oriented shortcuts are working against me: doc.search("//with.dot") seems to be looking for <with class="dot" /> (I ran into this problem with JQuery too, a few years ago.) Can I do this with Hpricot, or do I need to use a different library? 回答1: Check out nokogiri. It's said to be "A Faster,

Issue with unclosed img tag

夙愿已清 提交于 2019-12-12 12:46:58
问题 data presented in HTML format and submitted to server, that does some preprocessing. It operates with "src" attribute of "img" tag. After preprocessing and saving, all the preprocessed "img" tags are not self-closed. For example, if "img" tag was following: <img src="image.png" /> after preprocessing with Nokogiri or Hpricot, it will be: <img src="/preprocessed_path/image.png"> The code is pretty simple: doc = Hpricot(self.content) doc.search("img").each do |tag| preprocess tag end self

Using Ruby with Mechanize to log into a website

流过昼夜 提交于 2019-12-12 08:09:42
问题 I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require 'rubygems' require 'mechanize' a = Mechanize.new a.get('http://rubyforge.org/') do |page| # Click the login link login_page = a.click(page.link_with(:text => /Log In/)) # Submit the login form my_page = login_page.form_with(:action => '/account/login

Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-12 05:48:35
问题 I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions I am attempting to scrape a webpage such as the following: http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx I would like to scrape The Addresses, Phones and URL of the next Page which in this case is http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx I've been trying

Installing hpricot for JRuby

寵の児 提交于 2019-12-11 06:42:44
问题 I'm trying to look at cucumber for Jruby on Rails. One of the pre-requesites is webrat which has as pre-requisite hpricot. I've installed the gem with hpricot using: gem install hpricot --source http://code.whytheluckystiff.net --version 0.6.1 --platform java This installs the java version of hpricot. I add the hpricot_scan.jar to the CLASSPATH but when I run: cucumber features -n I get the following output: HpricotScanService.java:931:in `hpricot_scan': java.lang.NoSuchMethodError: org.jruby

Hpricot - UTF-8 issues

拟墨画扇 提交于 2019-12-11 05:45:53
问题 I get the following error when running the code below: invalid byte sequence in UTF-8 (ArgumentError) The code: require 'hpricot' require 'open-uri' doc = open('http://www.amazon.co.jp/') {|f| Hpricot(f.read) } puts doc.to_html Hpricot cannot parse the Japanese content. Any suggestions on fixing this issue? 回答1: The site doesn't seem to be using UTF-8: <meta http-equiv="content-type" content="text/html; charset=Shift_JIS" /> . Try this instead: open('http://www.amazon.co.jp/') {|f| Hpricot(f

Hpricot, Get all text from document

孤人 提交于 2019-12-10 15:41:06
问题 I have just started learning Ruby. Very cool language, liking it a lot. I am using the very handy Hpricot HTML parser. What I am looking to do is grab all the text from the page, excluding the HTML tags. Example: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Data Protection Checks</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <div> This is what I want to grab. </div> <p> I also want to grab this text </p> <

Tbody tag in xpath produced by fire bug

拜拜、爱过 提交于 2019-12-06 12:38:58
问题 I'm trying to extract some data from online htmls using ruby hpricot library. I use the firefox extension fire bug to get the xpath of a selected item. There's always the extra tbody tag present in the produced xpath expression. In some cases, I must remove the tbody tag from the expression to obtain the results while in other cases, I must keep the tag to get the results. I just can't figure out when to keep the tbody tag and when not to. 回答1: In order to take into account and avoid this

Convert HTML to plain text and maintain structure/formatting, with ruby

纵然是瞬间 提交于 2019-12-05 11:07:55
I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc. The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images). I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence. First, don't try to use regex for this. The odds are really good you'll come up with