Custom indent width for BeautifulSoup .prettify()

后端 未结 4 1601
一生所求
一生所求 2020-12-15 17:29

Is there any way to define custom indent width for .prettify() function? From what I can get from it\'s source -

def prettify(self, encoding=No         


        
相关标签:
4条回答
  • 2020-12-15 17:43

    Here's a way to increase indentation w/o meddling with original functions, etc. Create the following function:

    # Increase indentation of 'text' by 'n' spaces
    def add_indent(text,n):
      sp = " "*n
      lsep = chr(10) if text.find(chr(13)) == -1 else chr(13)+chr(10)
      lines = text.split(lsep)
      for i in range(len(lines)):
        spacediff = len(lines[i]) - len(lines[i].lstrip())
        if spacediff: lines[i] = sp*spacediff + lines[i] 
      return lsep.join(lines)
    

    Then convert the text you obtained using the above function:

    x = '''<section><article><h1></h1><p></p></article></section>'''
    soup = bs4.BeautifulSoup(x, 'html.parser')  # I don't know if you need 'html.parser'
    text = soup.prettify()                      # I do, otherwise I get a warning
    text = add_indent(text,1) # Increase indentation by 1 space 
    print(text)
    '''
    Output:
    <html>
      <body>
        <section>
          <article>
            <h1>
            </h1>
            <p>
            </p>
          </article>
        </section>
      </body>
    </html>
    '''
    
    0 讨论(0)
  • 2020-12-15 17:45

    As far as I can tell, this feature is not built in, as there are a handful of solutions out there for this problem.

    Assuming you are using BeautifulSoup 4, here are the solutions I came up with

    Hardcode it in. This requires minimal changes, this is fine if you don't need the indent to be different in different circumstances:

    myTab = 4 # add this
    if pretty_print:
       # space = (' ' * (indent_level - 1))
        space = (' ' * (indent_level - myTab))
        #indent_contents = indent_level + 1
        indent_contents = indent_level + myTab 
    

    Another problem with the previous solution is that the text content wont be indented entirely consistently, but attractively, still. If you need a more flexible/consistent solution, you can just modify the class.

    Find the prettify function and modify it as such (it is located in the Tag class in element.py):

    #Add the myTab keyword to the functions parameters (or whatever you want to call it), set it to your preferred default.
    def prettify(self, encoding=None, formatter="minimal", myTab=2): 
        Tag.myTab= myTab # add a reference to it in the Tag class
        if encoding is None:
            return self.decode(True, formatter=formatter)
        else:
            return self.encode(encoding, True, formatter=formatter)
    

    And then scroll up to the decode method in the Tag class and make the following changes:

    if pretty_print:
        #space = (' ' * (indent_level - 1))
        space = (' ' * (indent_level - Tag.myTab))
        #indent_contents = indent_level + Tag.myTab 
        indent_contents = indent_level + Tag.myTab
    

    Then go to the decode_contents method in the Tag class and make these changes:

    #s.append(" " * (indent_level - 1))
    s.append(" " * (indent_level - Tag.myTab))
    

    Now BeautifulSoup('<root><child><desc>Text</desc></child></root>').prettify(myTab=4) will return:

    <root>
        <child>
            <desc>
                Text
            </desc>
        </child>
    </root>
    

    **No need to patch BeautifulSoup class as it inherits the Tag class. Patching Tag class is sufficient enough to achieve the goal.

    0 讨论(0)
  • 2020-12-15 17:49

    I actually dealt with this myself, in the hackiest way possible: by post-processing the result.

    r = re.compile(r'^(\s*)', re.MULTILINE)
    def prettify_2space(s, encoding=None, formatter="minimal"):
        return r.sub(r'\1\1', s.prettify(encoding, formatter))
    

    Actually, I monkeypatched prettify_2space in place of prettify in the class. That's not essential to the solution, but let's do it anyway, and make the indent width a parameter instead of hardcoding it to 2:

    orig_prettify = bs4.BeautifulSoup.prettify
    r = re.compile(r'^(\s*)', re.MULTILINE)
    def prettify(self, encoding=None, formatter="minimal", indent_width=4):
        return r.sub(r'\1' * indent_width, orig_prettify(self, encoding, formatter))
    bs4.BeautifulSoup.prettify = prettify
    

    So:

    x = '''<section><article><h1></h1><p></p></article></section>'''
    soup = bs4.BeautifulSoup(x)
    print(soup.prettify(indent_width=3))
    

    … gives:

    <html>
       <body>
          <section>
             <article>
                <h1>
                </h1>
                <p>
                </p>
             </article>
          </section>
       </body>
    </html>
    

    Obviously if you want to patch Tag.prettify as well as BeautifulSoup.prettify, you have to do the same thing there. (You might want to create a generic wrapper that you can apply to both, instead of repeating yourself.) And if there are any other prettify methods, same deal.

    0 讨论(0)
  • 2020-12-15 17:54

    If you're using pycharm you can automatically reformat a html-file by pressing:

    ctrl + alt + L

    while having your prettified html-file open in pycharm.

    This will change indentation from one space to 4, or whatever you set for html in Settings>Editor>Code Style>HTML (default = 4).

    https://www.jetbrains.com/pycharm/guide/tips/reformat-code/

    You would have to do this for every html-file you run prettify() on.

    Sorry if I'm reviving an old thread but I had the same problem and found a simple solution.

    0 讨论(0)
提交回复
热议问题