Reading a text file word by word

后端 未结 9 940
梦谈多话
梦谈多话 2020-12-10 18:58

I have a text file containing just lowercase letters and no punctuation except for spaces. I would like to know the best way of reading the file char by char, in a way that

相关标签:
9条回答
  • 2020-12-10 19:30

    If you want to read it whitout spliting the string - for example lines are too long, so you might encounter OutOfMemoryException, you should do it like this (using streamreader):

    while (sr.Peek() >= 0)
    {
        c = (char)sr.Read();
        if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r'))
        {
            break;
        }
        else
            word += c;
    }
    return word;
    
    0 讨论(0)
  • 2020-12-10 19:40

    If you're interested in good performance even on very large files, you should have a look at the new(4.0) MemoryMappedFile-Class.

    For example:

    using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath))
    {
        using (Stream mmStream = mappedFile1.CreateViewStream())
        {
            using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
            {
                while (!sr.EndOfStream)
                {
                    var line = sr.ReadLine();
                    var lineWords = line.Split(' ');
                }
            }  
        }
    }
    

    From MSDN:

    A memory-mapped file maps the contents of a file to an application’s logical address space. Memory-mapped files enable programmers to work with extremely large files because memory can be managed concurrently, and they allow complete, random access to a file without the need for seeking. Memory-mapped files can also be shared across multiple processes.

    The CreateFromFile methods create a memory-mapped file from a specified path or a FileStream of an existing file on disk. Changes are automatically propagated to disk when the file is unmapped.

    The CreateNew methods create a memory-mapped file that is not mapped to an existing file on disk; and are suitable for creating shared memory for interprocess communication (IPC).

    A memory-mapped file is associated with a name.

    You can create multiple views of the memory-mapped file, including views of parts of the file. You can map the same part of a file to more than one address to create concurrent memory. For two views to remain concurrent, they have to be created from the same memory-mapped file. Creating two file mappings of the same file with two views does not provide concurrency.

    0 讨论(0)
  • 2020-12-10 19:46

    This code will extract words from a text file based on the Regex pattern. You can try playing with other patterns to see what works best for you.

        StreamReader reader =  new StreamReader(fileName);
    
        var pattern = new Regex(
                  @"( [^\W_\d]              # starting with a letter
                                            # followed by a run of either...
                      ( [^\W_\d] |          #   more letters or
                        [-'\d](?=[^\W_\d])  #   ', -, or digit followed by a letter
                      )*
                      [^\W_\d]              # and finishing with a letter
                    )",
                  RegexOptions.IgnorePatternWhitespace);
    
        string input = reader.ReadToEnd();
    
        foreach (Match m in pattern.Matches(input))
            Console.WriteLine("{0}", m.Groups[1].Value);
    
        reader.Close();       
    
    0 讨论(0)
提交回复
热议问题