How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?

前端未结

关注

 2  955

刺人心

It seems that GHC is at least inconsistent in the character encoding it decides to decode from.

Consider a file, omatase-shimashita.txt, with the follow

相关标签:

2条回答

梦毁少年i

2020-12-16 17:16

Which version of GHC are you using? Older versions especially didn't do unicode I/O very well.

This section in the GHC documentation describes how to change input/output encodings:

http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23

Also, the documentation says this:

A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing.

The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.)

Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors.

Maybe this has something to do with your problem? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.

Try running hSetEncoding handle utf8 and see if it fixes your problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-12-16 17:29

Your first example uses the standard IO library, System.IO. Operations in this library use the default system encoding (also known as localeEncoding) unless you specify otherwise. Presumably your system is set up to use UTF-8, so that is the encoding used by putStrLn, hGetContents and so on.

Your second example uses Data.ByteString. Since this library deals in sequences of bytes only, it does no encoding or decoding. So Data.ByteString.hGetLine converts the bytes in the file directly to a ByteString.

The best way to do text I/O in general is to use the text package.

0 讨论(0)
发布评论:

提交评论
- 加载中...