How do I pretty-print HTML with Nokogiri?

前端 未结 7 957
梦如初夏
梦如初夏 2020-12-01 10:58

I wrote a web crawler in Ruby and I\'m using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pr

相关标签:
7条回答
  • 2020-12-01 11:06

    why don't you try the pp method?

    require 'pp'
    pp some_var
    
    0 讨论(0)
  • 2020-12-01 11:14

    You can try REXML:

    require "rexml/document"
    
    doc = REXML::Document.new(xml)
    doc.write($stdout, 2)
    
    0 讨论(0)
  • 2020-12-01 11:20

    By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

    There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

    It comes down to this:

    xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
    html = Nokogiri(File.open("source.html"))
    puts xsl.apply_to(html).to_s
    

    It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

    0 讨论(0)
  • 2020-12-01 11:21

    The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

    • Parse the document as XML
    • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
    • Use to_xhtml or to_xml to specify pretty-printing parameters

    In action:

    html = '<section>
    <h1>Main Section 1</h1><p>Intro</p>
    <section>
    <h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
    </section><section>
    <h2>Subhead 1.2</h2><p>Meat</p>
    </section></section>'
    
    require 'nokogiri'
    doc = Nokogiri::XML(html,&:noblanks)
    puts doc
    #=> <section>
    #=>   <h1>Main Section 1</h1>
    #=>   <p>Intro</p>
    #=>   <section>
    #=>     <h2>Subhead 1.1</h2>
    #=>     <p>Meat</p>
    #=>     <p>MOAR MEAT</p>
    #=>   </section>
    #=>   <section>
    #=>     <h2>Subhead 1.2</h2>
    #=>     <p>Meat</p>
    #=>   </section>
    #=> </section>
    
    puts doc.to_xhtml( indent:3, indent_text:"." )
    #=> <section>
    #=> ...<h1>Main Section 1</h1>
    #=> ...<p>Intro</p>
    #=> ...<section>
    #=> ......<h2>Subhead 1.1</h2>
    #=> ......<p>Meat</p>
    #=> ......<p>MOAR MEAT</p>
    #=> ...</section>
    #=> ...<section>
    #=> ......<h2>Subhead 1.2</h2>
    #=> ......<p>Meat</p>
    #=> ...</section>
    #=> </section>
    
    0 讨论(0)
  • 2020-12-01 11:21

    This worked for me:

     pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3) 
    

    I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)

    0 讨论(0)
  • 2020-12-01 11:24

    I know I am extremely late to answer this question, but still, I'll leave the answer. I tried all the above steps and it does work to an extent.

    Nokogiri does format the HTML but does not care about the closing or the opening tag, hence pretty format is out of the picture.

    I found a gem called htmlbeautifier that works like a charm. I hope other people who are still searching for the answer will find this valuable.

    0 讨论(0)
提交回复
热议问题