UnicodeEncodeError in python3 when redirection is used

ⅰ亾dé卋堺 提交于 2021-02-10 20:14:13

问题


What I want to do: extract text information from a pdf file and redirect that to a txt file.

What I did:

pip install pdfminor

pdf2txt.py file.pdf > output.txt

What I got:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence

My observation:

\u2022 is bullet point, .

pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.

My question:

Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.

How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.


回答1:


Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character using the GBK codec. This probably means you're using a Chinese version of Windows.

A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.

You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.

set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt



回答2:


You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.

Making the encoding parameter more explicit is probably what you want.

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)


来源:https://stackoverflow.com/questions/59779618/unicodeencodeerror-in-python3-when-redirection-is-used

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!