Find specific comments in HTML code using python

南楼画角 提交于 2019-12-19 10:05:41

问题


I cant find a specific comment in python, in example the <!-- why -->. My main reason is to find all the links inside 2 specific comments. Something like a parser. I tried this with Beautifulsoup :

import urllib
over=urlopen("www.gamespot.com").read()
soup = BeautifulSoup(over)
print soup.find("<!--why-->")

But it doesn't work. I think I might have to use regex and not Beautifulsoup.

Please help.

EXAMPLE: we have HTML Code like this

<!--why-->
www.godaddy.com
<p> nice one</p>
www.wwf.com
<!-- why not-->

EDIT: Between the 2 comments, other stuff, like tags, might exist.

And I need to store all the links .


回答1:


If you want all the comments, you can use findAll with a callable:

>>> from bs4 import BeautifulSoup, Comment
>>> 
>>> s = """
... <p>header</p>
... <!-- why -->
... www.test1.com
... www.test2.org
... <!-- why not -->
... <p>tail</p>
... """
>>> 
>>> soup = BeautifulSoup(s)
>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment))
>>> 
>>> comments
[u' why ', u' why not ']

And once you've got them, you can use the usual tricks to move around:

>>> comments[0].next
u'\nwww.test1.com\nwww.test2.org\n'
>>> comments[0].next.split()
[u'www.test1.com', u'www.test2.org']

Depending on what the page actually looks like, you may have to tweak it a bit, and you'll have to choose which comments you want, but that should work to get you started.

Edit:

If you really want only the ones which look like some specific text, you can do something like

>>> comments = soup.findAll(text = lambda text: isinstance(text, Comment) and text.strip() == 'why')
>>> comments
[u' why ']

or you could filter them after the fact using a list comprehension:

>>> [c for c in comments if c.strip().startswith("why")]
[u' why ', u' why not ']


来源:https://stackoverflow.com/questions/12773921/find-specific-comments-in-html-code-using-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!