Python HTML scraping

元气小坏坏 提交于 2019-12-03 21:14:07

Regex is usally a bad idea, try using BeautifulSoup

Quick example:

html = #get html
soup = BeautifulSoup(html)
links = soup.findAll('a', attrs={'class': 'myclass'})
for link in links:
    #process link
Daniel Roseman

Aargh, not regex for parsing HTML!

Luckily in Python we have BeautifulSoup or lxml to do that job for us.

Regex would be a bad choice. HTML is not a regular language. How about Beautiful Soup?

John Keyes

Regex should not be used to parse HTML. See the first answer to this question for an explanation :)

+1 for BeautifulSoup.

If your task is just this simple, just use string manipulation (without even regex)

f=open("htmlfile")
for line in f:
    if "<a class" in line and "myClass" in line and "href" in line:
        s = line [ line.index("href") + len('href="') : ]
        print s[:s.index('">')]
f.close()

HTML parsers is not a must for such cases.

The thing is I know the structure of the HTML page, and I just want to find that specific kind of links (where class="myclass"). BeautifulSoup anyway?

George Godik
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!