How to split a huge file into words?

问题

How can I read a very long string from text file, and then process it (split into words)?

I tried the StreamReader.ReadLine() method, but I get an OutOfMemory exception. Apparently, my lines are extremely long. This is my code for reading file:

using (var streamReader = File.OpenText(_filePath))
    {

        int lineNumber = 1;
        string currentString = String.Empty;
        while ((currentString = streamReader.ReadLine()) != null)
        {

            ProcessString(currentString, lineNumber);
            Console.WriteLine("Line {0}", lineNumber);
            lineNumber++;
        }
    }

And the code which splits line into words:

var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
             select word.Value.ToLowerInvariant()).ToList();

回答1:

You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().

回答2:

Cut it into bit size sections. So that instead of trying to read 4gb, which I believe is about the size of a page, try to read 8 500mb chunks and that should help.

回答3:

Garbage collection may be a solution. I am not really sure that it is the problem source. But if it is the case, a simple GC.Collect is often unsufficient and, for performance reason, it should only be called if really required. Try the following procedure that calls the garbage when the available memory is too low (below the threshold provided as procedure parameter).

int charReadSinceLastMemCheck = 0 ;
using (var streamReader = File.OpenText(_filePath))
{

    int lineNumber = 1;
    string currentString = String.Empty;
    while ((currentString = streamReader.ReadLine()) != null)
    {

        ProcessString(currentString, lineNumber);
        Console.WriteLine("Line {0}", lineNumber);
        lineNumber++;
        totalRead+=currentString.Length ;
        if (charReadSinceLastMemCheck>1000000) 
        { // Check memory left every Mb read, and collect garbage if required
          CollectGarbage(100) ;
          charReadSinceLastMemCheck=0 ;
        } 
    }
}


internal static void CollectGarbage(int SizeToAllocateInMo)
{
       long [,] TheArray ;
       try { TheArray =new long[SizeToAllocateInMo,125000]; }low function 
       catch { TheArray=null ; GC.Collect() ; GC.WaitForPendingFinalizers() ; GC.Collect() ; }
       TheArray=null ;
}

来源：https://stackoverflow.com/questions/31256036/how-to-split-a-huge-file-into-words

标签

.net

file-io