Python - 'ascii' codec can't decode byte \xbd in position

问题

I'm using LXML to scrape some text from webpages. Some of the text includes fractions.

5½

I need to get this into a float format. These fail:

ugly_fraction.encode('utf-8')  #doesn't change to usable format
ugly_fraction.replace('\xbd', '')  #throws error
ugly_freaction.encode('utf-8').replace('\xbd', '')  #throws error

回答1:

unicodedata.numeric:

Returns the numeric value assigned to the Unicode character unichr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

Note that it only handles a single character, not a string. So, you still need to write the code that turns a "mixed fraction" made up of an integer and a fraction character into a float. But that's easy. For example. You just need to come up with the rule for how mixed fractions are represented in your data. For example, if pure ints, pure fractions, and ints followed by a fraction with no space in between them are the only possibilities, this works (including raising some kind of reasonable exception for all invalid cases):

def parse_mixed_fraction(s):
    if s.isdigit():
        return float(s)
    elif len(s) == 1:
        return unicodedata.numeric(s[-1])
    else:
        return float(s[:-1]) + unicodedata.numeric(s[-1])

来源：https://stackoverflow.com/questions/20110578/python-ascii-codec-cant-decode-byte-xbd-in-position

标签

python

unicode

web-scraping

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!