Python how to search and correct html tags and attributes?

问题

I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a >, it should close with />.

Is there any easy way to search for all the <img> in this text and fix the > ?

(If it is closed with a /> already then there is no action required).

Other question, if there is no "width" or "height" to the <img> specified, what is the best way to solve the issue?

Download all the images and get the corresponding attributes of width and height, then add them back to the string?

The correct <img> tag is the one that closes with /> and have the valid width & height.

<a href="http://www.cultofmac.com/daily-deals749-mac-mini-1199-3-0ghz-imac-new-mac-pros/52674"><img align="left" hspace="5" width="150" src="http://s3.dlnws.com/images/products/images/749000/749208-large" alt="" title=""></a>
Apple today unleashed a number of goodies, including giving iMacs and Mac Pros more oomph with new processors and increased storage options. We have those deals today, along with many more items for the Mac lover. Along with the refreshed line of iMacs and Mac Pros, we’ll also look at a number of software deals [...]
<p><a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/da"><img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/di" border="0" ismap></a><br>
<a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/da"><img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/di" border="0" ismap></a></p><img src="http://feeds.feedburner.com/~r/cultofmac/bFow/~4/Mq5iLOaT50k" height="1" width="1">

I really need to have width and height in the output, because it will be used as the input to other parser. And that parser says that the <img tag MUST close with a />. I am not using the output to view on the web page. Please suggest a simple solution to achieve this!

回答1:

For the sake of simplicity, I would outsource the potentially irritating issues around parsing (X)HTML to a dedicated library:

Here is a simple example with lxml.html:

import lxml.html

page = """<html>...</html>"""
page = lxml.html.document_fromstring(page)
lxml.html.tostring(page)

lxml.html has a really handy module clean, designed to remove malicious code. It's simple as well:

from lxml.html.clean import clean_html
clean_html(page)

回答2:

This is still the leading response for this google query, and perhaps it's because I didn't understand the question well enough.

What I was looking for (and perhaps what OP was looking for) was an xml dump instead of an html dump.

So to parse and get the output that I needed to properly hand it off, I used lxml.html like @Tim McNamara said.

import lxml.html
# read in the file
html_obj = lxml.html.fromstring(raw_html)
# whatever other dom manipulation you need to do
lxml.html.tostring(html_obj, method='xml')

回答3:

Well, <img ...> is correct HTML, <img .../> not. Dunno what HTML5 says, but XHTML is mostly dead before alive.

Nevertheless, I think the easiest thing would be a regular expression:

re.sub(r"<img(.*?)(?<!/)>", lambda m: "<img%s/>" % m.groups()[0],  html_code)

For the other things, well difficult. I would parse the code, add the tags to the img nodes and write the html from the ast. Parsing should be possible with http://code.google.com/p/html5lib/. But to have the valid height & width you have to read the images (use PIL) probably not worth the effort.

来源：https://stackoverflow.com/questions/3360968/python-how-to-search-and-correct-html-tags-and-attributes

标签

python

html

string

html-parsing