parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

后端 未结 3 1478
情深已故
情深已故 2020-12-03 16:14

I send a GET request to the CareerBuilder API :

import requests

url = \"http://api.careerbuilder.com/v1/jobsearch\"
payload = {\'DeveloperKey\': \'MY_DEVLOP         


        
3条回答
  •  孤城傲影
    2020-12-03 16:50

    Correction!

    See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml

    ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

    Which basically means that @martijn-pieters was right, we must use the raw response as returned by .content

    Incorrect answer (but might be interesting to someone)

    For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:

    Content of the response, in unicode.

    If Response.encoding is None, encoding will be guessed using chardet.

    The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

    So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'

    This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.

提交回复
热议问题