Invalid characters in File.ReadAllText

后端 未结 4 665
离开以前
离开以前 2020-12-06 17:52

I\'m calling File.ReadAllText() in a program designed to format some files that I have.

Some of these files contain the ® (174) symbol. H

相关标签:
4条回答
  • 2020-12-06 18:06

    The character you are reading is the Replacement character

    used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function

    http://www.fileformat.info/info/unicode/char/fffd/index.htm

    You are getting this because the actual encoding of the file does not match the encoding your program expects.

    By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.

    0 讨论(0)
  • 2020-12-06 18:14

    You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)

    The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.

    For example:

    Encoding isoLatin1 = Encoding.GetEncoding(28591);
    string text = File.ReadAllText(path, isoLatin1);
    

    It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

    0 讨论(0)
  • 2020-12-06 18:24

    Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.

    Code sample:

    string readText = File.ReadAllText(path, Encoding.Default);  // <-- change the encoding to whatever the encoding really is
    

    If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

    0 讨论(0)
  • 2020-12-06 18:28

    This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.

    The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

    0 讨论(0)
提交回复
热议问题