parsing invalid anchor tag with BeautifulSoup or Regex

醉酒当歌 提交于 2019-12-13 02:53:52

问题


I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as:

<a href="A 4"drive bay">some text here</a>

I know the href value may not be an actual link but let's just leave it that way. now what i need is to retrieve the href value 'A 4"drive bay' and the link text 'some text here'.

I am using python and i have tried the python library "BeautifulSoup" and it works pretty well in retrieving all the anchor tags. the problem though is that it flag error when it encounters the invalid anchor tag mentioned wherein the href value contains an ' " '. such case exists in the original data i am parsing and modifying such data is not an option..

A section of my python code using BeautifulSoup is:

sub_s = BeautifulSoup(line)
for l in sub_s.find_all('a'):
   l.replace_with(l.string)
print str(sub_s),

the code just replaces the anchor tag into a plain text

if someone could help me with the problem i would really much appreciate it... a regex would also do.. ^^


回答1:


I guess you could pre-filter your input text through a regular expression to correct this particular problem. Something like:

>>> r = re.compile('''<a[^>]+href="([^>]+)">''')
>>> m = r.match(text)
>>> m.group(1)
'A 4"drive bay'
>>> r.sub('<a href="%s">' % m.group(1).replace('"', ' '), text)
'<a href="A 4 drive bay">some text here</a>'

This isn't a complete solution; just an idea of how to move forward.




回答2:


Selfhtm 8.1.2 (documention of HTML used very frequently in Germany) recommends:

  1. First position latin character (a-z, A-Z)
  2. Later latin character, number (0-9), -, _ or .

I use the following regex to ensure the first requirement:

name="[^a-zA-Z]

(n. b. first leading space seems not so important, works on most regex-implementations, e. g. textpad editor from helios)

To ease work I have also a regex for the other requirement: It catches also one character anchor (they are valid), but it will help to identify possible problems:

name=".?[^a-zA-Z0-9_\.-][^"]*"

Most of other problems I find with a syntax checker.



来源:https://stackoverflow.com/questions/10487494/parsing-invalid-anchor-tag-with-beautifulsoup-or-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!