Read a text file with non-ASCII characters in an unknown encoding

后端 未结 2 1182
一生所求
一生所求 2020-12-16 14:59

I want to read a file that contains also German and not only characters. I found that i can do like this

  >>> import codecs
  >         


        
相关标签:
2条回答
  • 2020-12-16 15:30

    I believe the file is being read correctly but is using the wrong encoding when output. This is based on the fact that you get the proper results in IDLE.

    I would suggest trying to use print(line.encode('utf-8')) but I'm afraid I don't know if Python 3 will print a bytes object properly.

    0 讨论(0)
  • 2020-12-16 15:40

    You need to know which character encoding the text is encoded in. If you don't know that beforehand, you can try guessing it with the chardet module. First install it:

    $ pip install chardet
    

    Then, for example reading the file in binary mode:

    >>> import chardet
    >>> chardet.detect(open("file.txt", "rb").read())
    {'confidence': 0.9690625, 'encoding': 'utf-8'}
    

    So then:

    >>> import codecs
    >>> import unicodedata
    >>> lines = codecs.open('file.txt', 'r', encoding='utf-8').readlines()
    
    0 讨论(0)
提交回复
热议问题