问题
I want to be able to read a html file and extract only the tags out of it.
- Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well)
Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well)
<html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)
I can iterate through the text files and can get the list as below:
[['<', 'h', 't', 'm', 'l', '>', '\n'], ['<', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'h', 't', 'm', 'l', '>']]
but the expected output should be : [html,body,h1,/h1,/h2,/body,/html]
回答1:
In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']
Simply use regex or a HTMLParser.
来源:https://stackoverflow.com/questions/52239686/extract-html-tags-from-a-text-file-through-iteration-and-append-them-to-a-list-a