Weird characters added to first column name after reading a toad-exported csv file

前端未结

关注

 4  912

后悔当初 2020-12-29 02:50

Whenever I read a csv file in R (read.csv(\"file_name.csv\")) that was exported using toad, the first column name is preceded by the following char

4条回答

天涯浪人 (楼主)

2020-12-29 03:14

I recently ran into this with both the clipboard and Microsoft Excel

With the ever-increasing multi-lingual content used for data science there simply isn't a safe way to assume utf-8 any longer (in my case excel assumed UTF-16 because most of my data included Traditional Chinese (Mandarin?).

According to Microsoft Docs the following BOMs are used in Windows:

|----------------------|-------------|-----------------------|
| Encoding             | Bom         | Python encoding kwarg |
|----------------------|-------------|-----------------------|
| UTF-8                | EF BB BF    | 'utf-8'               |
| UTF-16 big-endian    | FE FF       | 'utf-16-be'           |
| UTF-16 little-endian | FF FE       | 'utf-16-le'           |
| UTF-32 big-endian    | 00 00 FE FF | 'utf-32-be'           |
| UTF-32 little-endian | FF FE 00 00 | 'utf-32-le'           |
|----------------------|-------------|-----------------------|

I came up with the following approach that seems to work well to detect encoding using the Byte Order Mark at the start of the file:

def guess_encoding_from_bom(filename, default='utf-8'):
    msboms = dict((bom['sig'], bom) for bom in (
        {'name': 'UTF-8', 'sig': b'\xEF\xBB\xBF', 'encoding': 'utf-8'},
        {'name': 'UTF-16 big-endian', 'sig': b'\xFE\xFF', 'encoding':
            'utf-16-be'},
        {'name': 'UTF-16 little-endian', 'sig': b'\xFF\xFE', 'encoding':
            'utf-16-le'},
        {'name': 'UTF-32 big-endian', 'sig': b'\x00\x00\xFE\xFF', 'encoding':
            'utf-32-be'},
        {'name': 'UTF-32 little-endian', 'sig': b'\xFF\xFE\x00\x00',
            'encoding': 'utf-32-le'}))

    with open(filename, 'rb') as f:
        sig = f.read(4)
        for sl in range(3, 0, -1):
            if sig[0:sl] in msboms:
                return msboms[sig[0:sl]]['encoding']
        return default


# Example using python csv module
def excelcsvreader(path, delimiter=',',
                doublequote=False, quotechar='"', dialect='excel',
                escapechar='\\', fileEncoding='UTF-8'):
    filepath = os.path.expanduser(path)
    fileEncoding = guess_encoding_from_bom(filepath, default=fileEncoding)
    if os.path.exists(filepath):
        # ok let's open it and parse the data
        with open(filepath, 'r', encoding=fileEncoding) as csvfile:
            csvreader = csv.DictReader(csvfile, delimiter=delimiter,
                doublequote=doublequote, quotechar=quotechar, dialect=dialect,
                escapechar='\\')
            for (rnum, row) in enumerate(csvreader):
                yield (rnum, row)

I realize that this requires opening the file for reading twice (once binary and once as encoded text) but the API doesn't really make it easy to do otherwise in this particular case.

At any rate, I think this is a bit more robust than simply assuming utf-8 and obviously the automatic encoding detection isn't working so...

0 讨论(0)

查看其它4个回答