Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python

我只是一个虾纸丫 提交于 2019-12-13 03:15:04

问题


I want to be able to read a html file and extract only the tags out of it.

  1. Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well)
  2. Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well)

      <html>
       <body>
       <h1>This is test</h1>
       <h2> This is test 2<h2>
       </body>
       <html>
    
    
       with open('doc.txt', 'r') as f:
                all_lines = []
                # loop through all lines using f.readlines() method
                for line in f.readlines():
                    new_line = []
                    # this is how you would loop through each alphabet
                    for chars in line:
                        new_line.append(chars)
                    all_lines.append(new_line)
    
                print(all_lines)
    

I can iterate through the text files and can get the list as below:

[['<', 'h', 't', 'm', 'l', '>', '\n'], ['<', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'h', 't', 'm', 'l', '>']]

but the expected output should be : [html,body,h1,/h1,/h2,/body,/html]


回答1:


In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

Simply use regex or a HTMLParser.



来源:https://stackoverflow.com/questions/52239686/extract-html-tags-from-a-text-file-through-iteration-and-append-them-to-a-list-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!