问题
I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league.
I was able to get everything installed, but then running sipie gives this:
/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5Traceback (most recent call last):
File "/usr/bin/Sipie/sipie.py", line 22, in <module>
Sipie.cliPlayer()File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer
completer = Completer(sipie.getStreams())File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams
streams = self.tryGetStreams()File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams
soup = BeautifulSoup(data)File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3
I looked through these files and the line numbers, but since I am unfamiliar with Python, it doesn't make much sense. Any advice on what to do next?
回答1:
The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.
First off, you'll need to:
sudo easy_install bs4
sudo apt-get install python-html5lib
Then, run this example code:
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib
url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)
# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)
# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)
for content in soup.findAll('div'):
print content
If you have any questions about this code or need a little more specific guidance, just let me know. :)
回答2:
Suppose you are using BeautifulSoup4, I found out something in the official document about this: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.
I tried this and it works well, just like what @Joshua
soup = BeautifulSoup(r.text, 'html5lib')
回答3:
Newer versions of BeautifulSoup uses HTMLParser rather than SGMLParser (due to SGMLParser being removed from the Python 3.0 standard library). As a result, BeautifulSoup can no longer process many malformed HTML documents correctly, which is what I believe you are encountering here.
A solution to your problem is likely to be to uninstall BeautifulSoup, and install an older version (which will still work with Python 2.6 on Ubuntu 10.04LTS):
sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"
Just be aware that this temporary solution will no longer work with Python 3.0 (which may become the default in future versions of Ubuntu).
回答4:
Command Line:
$ pip install beautifulsoup4
$ pip install html5lib
Python 3:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')
for link in links:
print(link.string, link['href'])
回答5:
Look at column 3 of line 100 in the "data" that is mentioned in File "/usr/bin/Sipie/Sipie/Factory.py", line 298
来源:https://stackoverflow.com/questions/3198874/malformed-start-tag-error-python-beautifulsoup-and-sipie-ubuntu-10-04