parsing invalid anchor tag with BeautifulSoup or Regex

问题

I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as:

<a href="A 4"drive bay">some text here</a>

I know the href value may not be an actual link but let's just leave it that way. now what i need is to retrieve the href value 'A 4"drive bay' and the link text 'some text here'.

I am using python and i have tried the python library "BeautifulSoup" and it works pretty well in retrieving all the anchor tags. the problem though is that it flag error when it encounters the invalid anchor tag mentioned wherein the href value contains an ' " '. such case exists in the original data i am parsing and modifying such data is not an option..

A section of my python code using BeautifulSoup is:

sub_s = BeautifulSoup(line)
for l in sub_s.find_all('a'):
   l.replace_with(l.string)
print str(sub_s),

the code just replaces the anchor tag into a plain text

if someone could help me with the problem i would really much appreciate it... a regex would also do.. ^^

回答1:

I guess you could pre-filter your input text through a regular expression to correct this particular problem. Something like:

>>> r = re.compile('''<a[^>]+href="([^>]+)">''')
>>> m = r.match(text)
>>> m.group(1)
'A 4"drive bay'
>>> r.sub('<a href="%s">' % m.group(1).replace('"', ' '), text)
'<a href="A 4 drive bay">some text here</a>'

This isn't a complete solution; just an idea of how to move forward.

回答2:

Selfhtm 8.1.2 (documention of HTML used very frequently in Germany) recommends:

First position latin character (a-z, A-Z)
Later latin character, number (0-9), -, _ or .

I use the following regex to ensure the first requirement:

name="[^a-zA-Z]

(n. b. first leading space seems not so important, works on most regex-implementations, e. g. textpad editor from helios)

To ease work I have also a regex for the other requirement: It catches also one character anchor (they are valid), but it will help to identify possible problems:

name=".?[^a-zA-Z0-9_\.-][^"]*"

Most of other problems I find with a syntax checker.

来源：https://stackoverflow.com/questions/10487494/parsing-invalid-anchor-tag-with-beautifulsoup-or-regex

标签

python

regex

parsing

html-parsing

beautifulsoup