Why does sed fail with International characters and how to fix?

前端 未结 2 2030
陌清茗
陌清茗 2020-12-03 14:28

GNU sed version 4.1.5 seems to fail with International chars. Here is my input file:

Gras Och Stenar Trad - From Moja to Minneapolis DVD [G2007D         


        
相关标签:
2条回答
  • 2020-12-03 14:55

    sed is not very well setup for non-ASCII text. However you can use (almost) the same code in perl and get the result you want:

    perl -pe 's/.*\| //' x
    
    0 讨论(0)
  • 2020-12-03 15:07

    I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

    Example: in is UTF-8

    $ LANG=de_DE.UTF-8 sed 's/.*| //' < in
    X
    Y
    $ LANG=de_DE.iso88591 sed 's/.*| //' < in
    X 
    Y
    

    UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

    Example: in is ISO-8859-1

    $ LANG=de_DE.UTF-8 sed 's/.*| //' < in
    X
    Gras Och Stenar Trad - From MöY
    $ LANG=de_DE.iso88591 sed 's/.*| //' < in
    X 
    Y
    

    ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

    The answer is based on Debian Lenny/Sid and sed 4.1.5.

    0 讨论(0)
提交回复
热议问题