问题
I only have a few weeks of python training, so I suspect that there's a simple solution to this problem. But for me it's quite frustrating and after working on this for several hours I now ask you for help!
The website I'm trying to scrape is well organized (see https://twam2dcppennla6s.onion.to/), and the code I've written scrapes about half of the 26 pages until I receive this error message:
Traceback (most recent call last):
File "SR2works4real2.py", line 18, in <module>
csvWriter.writerows(jsonObj['vendors'])
File "/usr/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128)
My code is:
import urllib2, json,csv
htmlTxt=""
urlpart1='https://twam2dcppennla6s.onion.to/vendors.php?_dc=1393967362998&start='
pageNum=0
urlpart2='&limit=30&sort=%5B%7B%22property%22%3A%22totalFeedback%22%2C%22direction%22%3A%22DESC%22%7D%5D'
csvFile=open('S141.csv','wb')
csvWriter=csv.DictWriter(csvFile,['name','vendoringTime','lastSeen','avgFeedback','id','totalFeedback','united','shipsTo','shipsFrom'],delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csvWriter.writeheader()
while htmlTxt != "{\"vendors\":[]}":
print("Page "+str(pageNum)+"...")
pageNum+=30
response=urllib2.urlopen((urlpart1)+str(pageNum)+(urlpart2))
htmlTxt=response.read()
htmlTxt.encode('utf-8')
jsonObj=json.loads(htmlTxt)
csvWriter.writerows(jsonObj['vendors'])
#print(str(jsonObj))
csvFile.close()
I hope there's someone out there that can help!
回答1:
That is unicode for the Trademarked symbol: http://www.marathon-studios.com/unicode/U2122/Trade_Mark_Sign
Since you're scraping web, you'll likely see a lot more of these types of errors, so replacing it might work for this page, but not others with other symbols.
The csv module is converting your unicode to ascii before writing it. I'd recommend you do the same before giving it the text, and clean it up yourself, that is, instead of
htmlTxt.encode('utf-8')
do
htmlTxt.encode('ascii', 'ignore')
And then check out the text to see if it is acceptable for your purposes.
EDIT
Here's my output in Python 3:
>>> u'\u2122'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 0: ordinal not in range(128)
>>> u'\u2122'.encode('ascii', 'ignore')
b''
and Python 2.6:
>>> u'\u2122'.encode('ascii')
Traceback (most recent call last):
File "<pyshell#92>", line 1, in <module>
u'\u2122'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
>>> u'\u2122'.encode('ascii', 'ignore')
''
回答2:
The strings in jsonObj
will be in unicode
type, because Python json
module will produce unicode strings. Your csv writer wants everything in str
type. In Python 2.7 it will try to automatically convert unicode
type to str
type assuming ASCII. This will of course fail if the unicode type does not contain ASCII.
The simplest fix would be to change this line:
csvWriter.writerows(jsonObj['vendors'])
to encode the unicode into str in utf8 just before sending to csv writer. jsonObj['vendors']
is a list of dictionaries with unicode keys and values, so we can do this:
unicode_vendors = jsonObj['vendors']
str_vendors = []
for unicode_dict in unicode_vendors:
str_dict = {}
for key, value in unicode_dict.items():
str_dict[key.encode('utf8')] = value.encode('utf8') if value else value
str_vendors.append(str_dict)
csvWriter.writerows(str_vendors)
来源:https://stackoverflow.com/questions/22184178/scraping-works-well-until-i-get-this-error-ascii-codec-cant-encode-character