Reading UTF-8 text files with ReadList

我的未来我决定 提交于 2019-12-01 03:19:08

问题


Is it possible to use ReadList to read UTF-8 (or any other) encoded text files using ReadList[..., Word], or is it ASCII-only? If it's ASCII-only, is it possible to "fix" the encoding of the already read data with good performance (i.e. preserving the performance advantages of ReadList over Import)?

Import[..., CharacterEncoding -> "UTF8"] works but it's quite a bit slower than ReadList. $CharacterEncoding has no effect on ReadList

Download a sample UTF-8 encoded file here.

For testing performance on a large input, see the test file in this question.


Here are the timings of the answers on a large-ish text file:

Import

In[2]:= Timing[
 data = Import[file, "Text"];
 ]

Out[2]= {5.234, Null}

Heike

In[4]:= Timing[
 data = ReadList[file, String];
 FromCharacterCode[ToCharacterCode[data], "UTF8"];
 ]

Out[4]= {4.328, Null}

Mr. Wizard

In[5]:= Timing[
 string = FromCharacterCode[BinaryReadList[file], "UTF-8"];
 ]

Out[5]= {2.281, Null}

回答1:


If I leave out Word, this works:

$CharacterEncoding = "UTF-8";

ReadList["UTF8.txt"]

This however is a failure, because the data is not read as strings.

Please try this on a larger file and report its performance:

FromCharacterCode[BinaryReadList["UTF8.txt"], "UTF-8"]



回答2:


This seems to work

FromCharacterCode[ToCharacterCode[ReadList["raw.php.txt", Word]], "UTF-8"]

The timings I get for the linked test file are

FromCharacterCode[ToCharacterCode[ReadList["test.txt", Word]], "UTF-8"]); // Timing

(* ==> {0.000195, Null} *)

Import["test.txt", "Text"]; // Timing

(* ==> {0.01784, Null} *)


来源:https://stackoverflow.com/questions/8254429/reading-utf-8-text-files-with-readlist

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!