How to convert a file to utf-8 in Python?

匿名 (未验证) 提交于 2019-12-03 02:08:02

问题:

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

I'd like to do the equivalent of:

iconv -t utf-8 $file > converted/$file # this is shell code 

Thanks!

回答1:

You can use the codecs module, like this:

import codecs BLOCKSIZE = 1048576 # or some other, desired size in bytes with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:     with codecs.open(targetFileName, "w", "utf-8") as targetFile:         while True:             contents = sourceFile.read(BLOCKSIZE)             if not contents:                 break             targetFile.write(contents) 

EDIT: added BLOCKSIZE parameter to control file chunk size.



回答2:

This worked for me in a small test:

sourceEncoding = "iso-8859-1" targetEncoding = "utf-8" source = open("source") target = open("target", "w")  target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding)) 


回答3:

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement  import os import sys import codecs from chardet.universaldetector import UniversalDetector  targetFormat = 'utf-8' outputDir = 'converted' detector = UniversalDetector()  def get_encoding_type(current_file):     detector.reset()     for line in file(current_file):         detector.feed(line)         if detector.done: break     detector.close()     return detector.result['encoding']  def convertFileBestGuess(filename):    sourceFormats = ['ascii', 'iso-8859-1']    for format in sourceFormats:      try:         with codecs.open(fileName, 'rU', format) as sourceFile:             writeConversion(sourceFile)             print('Done.')             return       except UnicodeDecodeError:         pass  def convertFileWithDetection(fileName):     print("Converting '" + fileName + "'...")     format=get_encoding_type(fileName)     try:         with codecs.open(fileName, 'rU', format) as sourceFile:             writeConversion(sourceFile)             print('Done.')             return     except UnicodeDecodeError:         pass      print("Error: failed to convert '" + fileName + "'.")   def writeConversion(file):     with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:         for line in file:             targetFile.write(line)  # Off topic: get the file list and call convertFile on each file # ... 

(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)



回答4:

To guess what's the source encoding you can use the file *nix command.

Example:

$ file --mime jumper.xml  jumper.xml: application/xml; charset=utf-8 


回答5:

This is a Python3 function for converting any text file into the one with UTF-8 encoding. (without using unnecessary packages)

def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'):     with open(filename, 'r', encoding=encoding_from) as fr:         with open(newFilename, 'w', encoding=encoding_to) as fw:             for line in fr:                 fw.write(line[:-1]+'\r\n') 

You can use it easily in a loop to convert a list of files.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!