Convert tabbed text to html unordered list?

前端未结

关注

 4  634

I\'m a beginner programmer so this question might sound trivial: I have some text files containg tab-delimited text like:


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2021-01-03 03:58
              
            
            
                                                                       
The algorithm is simple. You take the depth level of a line that is indicated with a tab \t and shift the next bullet to the right \t+\t or to the left \t\t-\t or leave it at the same level \t. 

Make sure your "in.txt" contains tabs or replace indent with tabs if you copy it from here. If indent is made of blank spaces nothing works. And the separator is a blank line at the end. You can change it in the code, if you want.

J.F. Sebastian's solution is fine but doesn't process unicode.

Create a text file "in.txt" in UTF-8 encoding:

qqq
    www
    www
        яяя
        яяя
    ыыы
    ыыы
qqq
qqq


and run the script "ul.py". The script will create the "out.html" and open it in Firefox.

#!/usr/bin/python
# -*- coding: utf-8 -*-

# The script exports a tabbed list from string into a HTML unordered list.

import io, subprocess, sys

f=io.open('in.txt', 'r',  encoding='utf8')
s=f.read()
f.close()

#---------------------------------------------

def ul(s):

    L=s.split('\n\n')

    s='<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n\
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body>'

    for p in L:
        e=''
        if p.find('\t') != -1:

            l=p.split('\n')
            depth=0
            e='<ul>'
            i=0

            for line in l:
                if len(line) >0:
                    a=line.split('\t')
                    d=len(a)-1

                    if depth==d:
                        e=e+'<li>'+line+'</li>'


                    elif depth < d:
                        i=i+1
                        e=e+'<ul><li>'+line+'</li>'
                        depth=d


                    elif depth > d:
                        e=e+'</ul>'*(depth-d)+'<li>'+line+'</li>'
                        depth=d
                        i=depth


            e=e+'</ul>'*i+'</ul>'
            p=e.replace('\t','')

            l=e.split('<ul>')
            n1= len(l)-1

            l=e.split('</ul>')
            n2= len(l)-1

            if n1 != n2:
                msg='<div style="color: red;">Wrong bullets position.<br>&lt;ul&gt;: '+str(n1)+'<br>&lt;&frasl;ul&gt;: '+str(n2)+'<br> Correct your source.</div>'
                p=p+msg

        s=s+p+'\n\n'

    return s

#-------------------------------------      

def detach(cmd):
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    sys.exit()

s=ul(s)

f=io.open('out.html', 'w',  encoding='utf8')
s=f.write(s)
f.close()

cmd='firefox out.html'
detach(cmd)


HTML will be:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body><ul><li>qqq</li><ul><li>www</li><li>www</li><ul><li>яяя</li><li>яяя</li></ul><li>ыыы</li><li>ыыы</li></ul><li>qqq</li><li>qqq</li></ul>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-03 04:10
              
            
            
                                                                       
tokenize module understands your input format: lines contain a valid Python identifiers, the indentation level of the statements is significant. ElementTree module allows you to manipulate tree structures in memory so it might be more flexable to separate a tree creation from a rendering it as html:

from tokenize import NAME, INDENT, DEDENT, ENDMARKER, NEWLINE, generate_tokens
from xml.etree import ElementTree as etree

def parse(file, TreeBuilder=etree.TreeBuilder):
    tb = TreeBuilder()
    tb.start('ul', {})
    for type_, text, start, end, line in generate_tokens(file.readline):
        if type_ == NAME: # convert name to <li> item
            tb.start('li', {})
            tb.data(text)
            tb.end('li')
        elif type_ == NEWLINE:
            continue
        elif type_ == INDENT: # start <ul>
            tb.start('ul', {})
        elif type_ == DEDENT: # end </ul>
            tb.end('ul')
        elif type_ == ENDMARKER: # done
            tb.end('ul') # end parent list
            break
        else: # unexpected token
            assert 0, (type_, text, start, end, line)
    return tb.close() # return root element


Any class that provides .start(), .end(), .data(), .close() methods can be used as a TreeBuilder e.g., you could just write html on the fly instead of building a tree.

To parse stdin and write html to stdout you could use ElementTree.write():

import sys

etree.ElementTree(parse(sys.stdin)).write(sys.stdout, method='html')


Output:

<ul><li>A</li><ul><li>B</li><li>C</li><ul><li>D</li><li>E</li></ul></ul></ul>


You can use any file, not just sys.stdin/sys.stdout.

Note: To write to stdout on Python 3 use sys.stdout.buffer or encoding="unicode" due to  bytes/Unicode distinction. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2021-01-03 04:14
              
            
            
                                                                       
I think the algorithm goes like this:


keep track of the current indentation level (by counting the number of tabs per line)
if the indentation level increase: emit <ul> <li>current item</li>
if the indentation level decreases: emit <li>current item</li></ul>
if the indentation level remains the same: emit <li>current item</li>


Putting this into code is left to the OP as exercise
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-03 04:16
              
            
            
                                                                       
Try this (works on your test case):

import itertools
def listify(filepath):
    depth = 0
    print "<ul>"*(depth+1)
    for line in open(filepath):
        line = line.rstrip()
        newDepth = sum(1 for i in itertools.takewhile(lambda c: c=='\t', line))
        if newDepth > depth:
            print "<ul>"*(newDepth-depth)
        elif depth > newDepth:
            print "</ul>"*(depth-newDepth)
        print "<li>%s</li>" %(line.strip())
        depth = newDepth
    print "</ul>"*(depth+1)


Hope this helps
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复