Python read from file and remove non-ascii characters

喜你入骨 提交于 2019-12-04 12:38:49

codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.

From the docs for codecs.open:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

I presume you're using Windows, where the newline sequence is actually '\r\n'. A file opened in text mode will do the conversion from \n to \r\n automatically, but that doesn't happen with codecs.open.

Simply write "\r\n" instead of "\n" and it should work fine, at least on Windows.

use codecs to open the csv file and then you can avoid the non-ascii characters

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!