Fastest way to search string in large text file [closed]

我怕爱的太早我们不能终老 提交于 2019-12-20 15:24:12

问题


What is the fastest technology/algorithm that can be implemented in order to lookup list of strings in a quite large text files (Up to 1GB text files). For starters, I'm using C# and was able to achieve the logic (Simply by matching a file with list of strings, string by string every time. Which means that the file will be read n number of strings to match with" times ), but since I'm dealing with a lot of files, it is taking forever running through them and get matches. I'm open to any suggestion even if it was not C#..

To elaborate more, I have a text file that contains many numbers(A), and I have a lot of large files(B). I'm trying to take every element in (A) and see if there is a match for it in (B) line by line. If there is a match I write the whole line into a text file. The way I'm doing it is really traditional and it takes a lot of time to get done with a single file, while I have hundreds of them with sizes up to 1GB

Appreciate your time


回答1:


The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

Update after more info

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
    matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
    foreach (var line in File.ReadLines(filename))
    {
        var matches = matcher.Search(line);
        foreach (m in matches)
        {
            // output match
        }
    }
}

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

Matching whole words

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
    var matches = re.Matchs(line);
    foreach (var m in matches)
    {
        if (hashSetOfValues.Contains(m))
        {
            // output match
        }
    }
}

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

There are other options for whole words, as I mention in the comments.

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.




回答2:


foreach (string  _row in System.IO.File.ReadAllLines(AddressFilePath))
{
  if(_row.Contains(TextToSearch)){
     //Do something
     Console.WriteLine(_row);
  }
}


来源:https://stackoverflow.com/questions/37979457/fastest-way-to-search-string-in-large-text-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!