Search Large Text File for Thousands of strings

前端 未结 3 512
无人共我
无人共我 2021-01-15 03:31

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.

I hav

3条回答
  •  梦谈多话
    2021-01-15 04:20

    Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:

    hand, has, have, foot, file
    

    The resulting tree would look something like this: Tree generated by the list of search terms

    The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

    Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.

    • If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
    • If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree

    This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

    Edit

    The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm

    tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
    # Keeps track of where I'm currently in the tree
    nodes = []
    for character in huge_file:
      foreach node in nodes:
        if node.has_child(character):
          node.follow_edge(character)
          if node.isLeaf():
            # You found a match!!
        else:
          nodes.delete(node)
      if tree.has_child(character):
        nodes.add(tree.get_child(character))
    

    Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

提交回复
热议问题