I have a text file containing just lowercase letters and no punctuation except for spaces. I would like to know the best way of reading the file char by char, in a way that
If you want to read it whitout spliting the string - for example lines are too long, so you might encounter OutOfMemoryException, you should do it like this (using streamreader):
while (sr.Peek() >= 0)
{
c = (char)sr.Read();
if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r'))
{
break;
}
else
word += c;
}
return word;
If you're interested in good performance even on very large files, you should have a look at the new(4.0) MemoryMappedFile-Class.
For example:
using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath))
{
using (Stream mmStream = mappedFile1.CreateViewStream())
{
using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
var lineWords = line.Split(' ');
}
}
}
}
From MSDN:
A memory-mapped file maps the contents of a file to an application’s logical address space. Memory-mapped files enable programmers to work with extremely large files because memory can be managed concurrently, and they allow complete, random access to a file without the need for seeking. Memory-mapped files can also be shared across multiple processes.
The CreateFromFile methods create a memory-mapped file from a specified path or a FileStream of an existing file on disk. Changes are automatically propagated to disk when the file is unmapped.
The CreateNew methods create a memory-mapped file that is not mapped to an existing file on disk; and are suitable for creating shared memory for interprocess communication (IPC).
A memory-mapped file is associated with a name.
You can create multiple views of the memory-mapped file, including views of parts of the file. You can map the same part of a file to more than one address to create concurrent memory. For two views to remain concurrent, they have to be created from the same memory-mapped file. Creating two file mappings of the same file with two views does not provide concurrency.
This code will extract words from a text file based on the Regex pattern. You can try playing with other patterns to see what works best for you.
StreamReader reader = new StreamReader(fileName);
var pattern = new Regex(
@"( [^\W_\d] # starting with a letter
# followed by a run of either...
( [^\W_\d] | # more letters or
[-'\d](?=[^\W_\d]) # ', -, or digit followed by a letter
)*
[^\W_\d] # and finishing with a letter
)",
RegexOptions.IgnorePatternWhitespace);
string input = reader.ReadToEnd();
foreach (Match m in pattern.Matches(input))
Console.WriteLine("{0}", m.Groups[1].Value);
reader.Close();