Parsing HTML page containing & using Python

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-27 16:10:47

问题


I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &

Script:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)

This throws following error in Python 2.7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73

Error corresponds to following line

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />

Looks like when HTML page is read, & sign is not parsed as &amp; in variable r

I tried to parse using htmlTreeParse using R program and "&" gets converted to &amp; properly.

Let me know if I am missing anything in urllib2

EDIT : I replaced "&" to &amp; but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.

LINE:904    for (i = 0; i < strac.length - 1; i++) {

回答1:


First of all, xml.etree.ElementTree is an XML parser. It does not handle HTML entities out of the box. & is an illegal thing to have inside the XML and this is why it is failing.

Get yourself going with a real specialized HTML parser, BeautifulSoup:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'

See also:

  • How to parse malformed HTML in python, using standard libraries


来源:https://stackoverflow.com/questions/23707647/parsing-html-page-containing-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!