How do locales work in Linux / POSIX and what transformations are applied?
问题 I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04. While investigating a bug I've encountered strange behavoir $ export LC_ALL=en_US.UTF-8 $ sort part-r-00000 | uniq -d ɥ ɨ ɞ ɧ 251 ɨ ɡ ɞ ɭ ɯ 291 ɢ ɫ ɬ ɜ 301 ɪ ɳ 475 ʈ ʂ 565 $ export LC_ALL=C $ sort part-r-00000 | uniq -d $ # no duplicates found The duplicates also appear when running a custom C++ program that reads the file using std::stringstream - it fails due to