character based file stream in .NET

问题

I need to modify a textfile of unknown encoding in that I need to insert some text after the first occurence of a predefined string (e.g. "#markx#"). Is there a class in .NET that allows me to randomly access the content of a file but based on characters (as opposed to bytes). Since the Stream.Seek Methods work on byte basis I would not only need to know the encoding but also know if there are some special control bytes (such as the first bytes at the beginning of unicode file). I would love to have a class that abstact all this away and allows me to "say": seek to 25th character and add some string there just as a texteditor would do it.

回答1:

You can use a StreamReader to go through one character at a time - there isn't a Seek method, but you can still read byte-by-byte and so effectively implement your own seek.

With regard to encodings - you will need to have identified the encoding in order to use the StreamReader.

However, the StreamReader itself can help if you create it with one of the constructor overloads that allows you to supply the flag detectEncodingFromByteOrderMarks as true (or you can use Encoding.GetPreamble and look at the byte preamble yourself).

Both these methods will only help auto-detect UTF based encodings though - so any ANSI encodings with a specified codepage will probably not be parsed correctly.

回答2:

Given that characters can take a variable number of bytes this would be pretty tough to do without converting the bytes to characters with a TextReader.

You could wrap up a TextReader and give it a Seek method that ensures enough characters have been loaded to satisfy each request.

回答3:

You can't know what each character is without knowing what encoding the file is using.

You can loop through all encodings and try them one by one, or guess at the encoding.

回答4:

The layer of abstraction over the standard stream "seek", would involve reading each character in turn from the file (by default .net assumes files are UTF-8), so any file that doesn't start with a BOM assumes that the file is UTF-8.

UTF-8 has variable size characters, so you can't know how many bytes a character takes up until you read that byte.

Therefore, you have to sequentially access each byte in the file to know where each byte starts/ends.

In conclusion, if you know the file is AscII, UTF-16 or UTF-32, you can do this because you know the size for each character (as far as I know, if I'm wrong, please correct me)

If it's UTF-8 you can't "seek" to a character.

Hope this helps,

来源：https://stackoverflow.com/questions/1927687/character-based-file-stream-in-net

标签

.net

file

text