[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are
You are attempting to apply a solution to a different problem. Note this:
def utf_8_encoder(unicode_csv_data)
You are feeding it str objects.
The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:
for row in csv.reader("foo.csv", delimiter=known_delimiter):
row = [item.decode(encoding) for item in row]
Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".
Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.
Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""
You must know the encoding for ANY file-reading exercise to work.
Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252 for your euro file and windows-1251 for your Russian file -- a fantastic achievement given their tiny size.
Update 2 in response to """working code would be most welcome"""
Working code (Python 2.x):
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()
def charset_detect(f, chunk_size=4096):
global chardet_detector
chardet_detector.reset()
while 1:
chunk = f.read(chunk_size)
if not chunk: break
chardet_detector.feed(chunk)
if chardet_detector.done: break
chardet_detector.close()
return chardet_detector.result
# Exercise for the reader: replace the above with a class
import csv
import sys
from pprint import pprint
pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)
with open(pathname, 'rb') as f:
cd_result = charset_detect(f)
encoding = cd_result['encoding']
confidence = cd_result['confidence']
print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
# insert actions contingent on encoding and confidence here
f.seek(0)
csv_reader = csv.reader(f, delimiter=delim)
for bytes_row in csv_reader:
unicode_row = [x.decode(encoding) for x in bytes_row]
pprint(unicode_row)
Output 1:
delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
u'Overf\xf8rsel utland',
u'UTLBET; ID 9710032001647082',
u'1990.00',
u'']
[u'31-01-11',
u'Overf\xf8ring',
u'OVERF\xd8RING MELLOM EGNE KONTI',
u'5750.00',
u';']
Output 2:
delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
u'04.02.2011 23:20',
u'300,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421',
u'']
[u'-',
u'04.02.2011 23:15',
u'450,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
u'']
[u'-',
u'13.01.2011 02:05',
u'100,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421 kolombina',
u'']
Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd.