How to split a huge file into words?

不想你离开。 提交于 2020-01-03 13:34:25

问题


How can I read a very long string from text file, and then process it (split into words)?

I tried the StreamReader.ReadLine() method, but I get an OutOfMemory exception. Apparently, my lines are extremely long. This is my code for reading file:

using (var streamReader = File.OpenText(_filePath))
    {

        int lineNumber = 1;
        string currentString = String.Empty;
        while ((currentString = streamReader.ReadLine()) != null)
        {

            ProcessString(currentString, lineNumber);
            Console.WriteLine("Line {0}", lineNumber);
            lineNumber++;
        }
    }

And the code which splits line into words:

var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
             select word.Value.ToLowerInvariant()).ToList();

回答1:


You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().




回答2:


Cut it into bit size sections. So that instead of trying to read 4gb, which I believe is about the size of a page, try to read 8 500mb chunks and that should help.




回答3:


Garbage collection may be a solution. I am not really sure that it is the problem source. But if it is the case, a simple GC.Collect is often unsufficient and, for performance reason, it should only be called if really required. Try the following procedure that calls the garbage when the available memory is too low (below the threshold provided as procedure parameter).

int charReadSinceLastMemCheck = 0 ;
using (var streamReader = File.OpenText(_filePath))
{

    int lineNumber = 1;
    string currentString = String.Empty;
    while ((currentString = streamReader.ReadLine()) != null)
    {

        ProcessString(currentString, lineNumber);
        Console.WriteLine("Line {0}", lineNumber);
        lineNumber++;
        totalRead+=currentString.Length ;
        if (charReadSinceLastMemCheck>1000000) 
        { // Check memory left every Mb read, and collect garbage if required
          CollectGarbage(100) ;
          charReadSinceLastMemCheck=0 ;
        } 
    }
}


internal static void CollectGarbage(int SizeToAllocateInMo)
{
       long [,] TheArray ;
       try { TheArray =new long[SizeToAllocateInMo,125000]; }low function 
       catch { TheArray=null ; GC.Collect() ; GC.WaitForPendingFinalizers() ; GC.Collect() ; }
       TheArray=null ;
}


来源:https://stackoverflow.com/questions/31256036/how-to-split-a-huge-file-into-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!