A resilient, actually working CSV implementation for non-ascii?

前端 未结 4 2218
萌比男神i
萌比男神i 2020-12-30 06:30

[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are

相关标签:
4条回答
  • 2020-12-30 07:02

    I don't know if you've already tried this, but in the example section for the official Python documentation for the csv module, you'll find a pair of classes; UnicodeReader and UnicodeWriter. They worked fine for me so far.

    Correctly detecting the encoding of a file seems to be a very hard problem. You can read the discussion in this StackOverflow thread.

    0 讨论(0)
  • 2020-12-30 07:03

    What you are asking is impossible. There is no way to write a program in any language that will accept input in an unknown encoding and correctly convert it to Unicode internal representation.

    You have to find a way to tell the application which encoding to use.

    It is possible to recognize many, but not all, encodingshardet but it really depends on what the content of the files is and whether there are enough data points. This is similar to the issue of correctly decoding filenames on network servers. When a file is created on a network server, there is no way to tell the server what encoding is used, so if you have a folder with names in multiple encodings they are guaranteed to look odd to some, if not all, users and different files will seem odd.

    However, don't give up. Try the chardet encoding detector mentioned in this question: https://serverfault.com/questions/82821/how-to-tell-the-language-encoding-of-a-filename-on-linux and if you are lucky, you won't get many failures.

    0 讨论(0)
  • 2020-12-30 07:06

    You are attempting to apply a solution to a different problem. Note this:

    def utf_8_encoder(unicode_csv_data)

    You are feeding it str objects.

    The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:

    for row in csv.reader("foo.csv", delimiter=known_delimiter):
       row = [item.decode(encoding) for item in row]
    

    Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".

    Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.

    Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""

    You must know the encoding for ANY file-reading exercise to work.

    Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252 for your euro file and windows-1251 for your Russian file -- a fantastic achievement given their tiny size.

    Update 2 in response to """working code would be most welcome"""

    Working code (Python 2.x):

    from chardet.universaldetector import UniversalDetector
    chardet_detector = UniversalDetector()
    
    def charset_detect(f, chunk_size=4096):
        global chardet_detector
        chardet_detector.reset()
        while 1:
            chunk = f.read(chunk_size)
            if not chunk: break
            chardet_detector.feed(chunk)
            if chardet_detector.done: break
        chardet_detector.close()
        return chardet_detector.result
    
    # Exercise for the reader: replace the above with a class
    
    import csv    
    import sys
    from pprint import pprint
    
    pathname = sys.argv[1]
    delim = sys.argv[2] # allegedly known
    print "delim=%r pathname=%r" % (delim, pathname)
    
    with open(pathname, 'rb') as f:
        cd_result = charset_detect(f)
        encoding = cd_result['encoding']
        confidence = cd_result['confidence']
        print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
        # insert actions contingent on encoding and confidence here
        f.seek(0)
        csv_reader = csv.reader(f, delimiter=delim)
        for bytes_row in csv_reader:
            unicode_row = [x.decode(encoding) for x in bytes_row]
            pprint(unicode_row)
    

    Output 1:

    delim=',' pathname='sample-euro.csv'
    chardet: encoding=windows-1252 confidence=0.500
    [u'31-01-11',
     u'Overf\xf8rsel utland',
     u'UTLBET; ID 9710032001647082',
     u'1990.00',
     u'']
    [u'31-01-11',
     u'Overf\xf8ring',
     u'OVERF\xd8RING MELLOM EGNE KONTI',
     u'5750.00',
     u';']
    

    Output 2:

    delim=';' pathname='sample-russian.csv'
    chardet: encoding=windows-1251 confidence=0.602
    [u'-',
     u'04.02.2011 23:20',
     u'300,00\xa0\u0440\u0443\u0431.',
     u'',
     u'\u041c\u0422\u0421',
     u'']
    [u'-',
     u'04.02.2011 23:15',
     u'450,00\xa0\u0440\u0443\u0431.',
     u'',
     u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
     u'']
    [u'-',
     u'13.01.2011 02:05',
     u'100,00\xa0\u0440\u0443\u0431.',
     u'',
     u'\u041c\u0422\u0421 kolombina',
     u'']
    

    Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd.

    0 讨论(0)
  • 2020-12-30 07:09

    You are doing the wrong thing in your code by trying to .encode('utf-8'), you should be decoding it instead. And btw, unicode(bytestr, 'utf-8') == bytestr.decode('utf-8')

    But most importantly, WHY are you trying to decode the strings?

    Sounds a bit absurd but you can actually work with those CSV without caring whether they are cp1251, cp1252 or utf-8. The beauty of it all is that the regional characters are >0x7F and utf-8 too, uses sequences of >0x7F characters to represent non-ASCII symbols.

    Since the separators CSV cares about (be it , or ; or \n) are within ASCII, its work won't be affected by the encoding used (as long as it is one-byte or utf-8!).

    Important thing to note is that you should give to Python 2.x csv module files opened in binary mode - that is either 'rb' or 'wb' - because of the peculiar way it was implemented.

    0 讨论(0)
提交回复
热议问题