How to determine the encoding of text?

前端 未结 10 1665
一向
一向 2020-11-21 07:47

I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc

10条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-21 08:13

    Some encoding strategies, please uncomment to taste :

    #!/bin/bash
    #
    tmpfile=$1
    echo '-- info about file file ........'
    file -i $tmpfile
    enca -g $tmpfile
    echo 'recoding ........'
    #iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
    #enca -x utf-8 $tmpfile
    #enca -g $tmpfile
    recode CP1250..UTF-8 $tmpfile
    

    You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

    encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc]
                for e in encodings:
                    try:
                        fh = codecs.open('file.txt', 'r', encoding=e)
                        fh.readlines()
                        fh.seek(0)
                    except UnicodeDecodeError:
                        print('got unicode error with %s , trying different encoding' % e)
                    else:
                        print('opening the file with encoding:  %s ' % e)
                        break              
    

提交回复
热议问题