Using SAX Parser to get several sub-nodes?

最后都变了- 提交于 2019-12-23 04:38:18

问题


I have a large local XML file (24 GB) with a structure like this:

<id>****</id>
<url> ****</url> (several times within an id...)

I need a result like this:

id1;url1
id1;url2
id1;url3
id2;url4
....

I wanted to use Nokigiri either with the SAX Parser or the Reader since I can't load the whole file into memory. I am using a Ruby Rake task to execute the code.

My code with SAX is:

task :fetch_saxxml => :environment do

  require 'nokogiri'
  require 'open-uri'

  class MyDocument < Nokogiri::XML::SAX::Document
    attr_accessor :is_name

    def initialize
      @is_name = false
    end

    def start_element name, attributes = []
      @is_name = name.eql?("id")
    end

    def characters string
      string.strip!
      if @is_name and !string.empty?
        puts "ID: #{string}"
      end
    end

    def end_document
      puts "the document has ended"
    end

  end

  parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
  parser.parse_file('/path_to_my_file.xml')

end

That is fine in order to fetch the IDs in the file but I need to fetch the URLs within each id node, too.

How do I put something like "each do" within that code to fetch the URLs and have an output like that shown above? Or is it possible to call several actions within "characters"?


回答1:


Actually this is a solution to parse several nodes when they occur. The problem with SAX parsers is that you have to find a way to handle special characters like "&" and so on... but that is another story.

Here is my code:

class MyDoc < Nokogiri::XML::SAX::Document
  def start_element name, attrs = []
    @inside_content = true if name == 'yourvalue'
    @current_element = name
  end


  def characters str

    if @current_element == 'your_1st subnode'

    elsif @current_element == 'your 2nd subnode'


    end
    puts "#{@current_element} - #{str}" if @inside_content && %w{your_subnodes here}.include?(@current_element)
  end

  def end_element name
    @inside_content = false if name == 'yourvalue'
    @current_element = nil
  end

end

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse_file('/path_to_your.xml')

end


来源:https://stackoverflow.com/questions/14662728/using-sax-parser-to-get-several-sub-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!