Replace Long list Words in a big Text File

问题

i need a fast method to work with big text file

i have 2 files, a big text file (~20Gb) and an another text file that contain ~12 million list of Combo words

i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline)

example "Computer Information" >Replace With> "Computer_Information"

i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core)

public partial class Form1 : Form
{
    HashSet<string> wordlist = new HashSet<string>();

    private void loadComboWords()
    {
        using (StreamReader ff = new StreamReader(txtComboWords.Text))
        {
            string line;
            while ((line = ff.ReadLine()) != null)
            {
                wordlist.Add(line);
            }
        }
    }

    private void replacewords(ref string str)
    {

        foreach (string wd in wordlist)
        {
          //  ReplaceEx(ref str,wd,wd.Replace(" ","_"));
            if (str.IndexOf(wd) > -1)
                str.Replace(wd, wd.Replace(" ", "_"));
        }
    }

    private void button3_Click(object sender, EventArgs e)
    {
        string line;
        using (StreamReader fread = new StreamReader(txtFirstFile.Text))
        {
            string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
            StreamWriter sw = new StreamWriter(writefile);
            long intPercent;
            label3.Text = "initialing";
            loadComboWords();

            while ((line = fread.ReadLine()) != null)
            {
                replacewords(ref line);
                sw.WriteLine(line);

                intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
                Application.DoEvents();
                label3.Text = intPercent.ToString();
            }
            sw.Close();
            fread.Close();
            label3.Text = "Finished";
        }
    }
}

any idea to do this job in reasonable time

Thanks

回答1:

At first glance the approach you've taken looks fine - it should work OK, and there's nothing obvious that will cause e.g. lots of garbage collection.

The main thing I think is that you'll only be using one of those sixteen cores: there's nothing in place to share the load across the other fifteen.

I think the easiest way to do this is to split the large 20Gb file into sixteen chunks, then analyse each of the chunks together, then merge the chunks back together again. The extra time taken splitting and reassembling the file should be minimal compared to the ~16 times gain involved in scanning these sixteen chunks together.

In outline, one way to do this might be:

    private List<string> SplitFileIntoChunks(string baseFile)
    {
        // Split the file into chunks, and return a list of the filenames.
    }

    private void AnalyseChunk(string filename)
    {
        // Analyses the file and performs replacements, 
        // perhaps writing to the same filename with a different
        // file extension
    }

    private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
    {
        // Combines the rewritten chunks created by AnalyseChunk back into
        // one large file, outputFile.
    }

    public void AnalyseFile(string inputFile, string outputFile)
    {
        List<string> splitFileNames = SplitFileIntoChunks(inputFile);

        var tasks = new List<Task>();
        foreach (string chunkName in splitFileNames)
        {
            var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
            tasks.Add(task);
        }

        Task.WaitAll(tasks.ToArray());

        CreateOutputFileFromChunks(outputFile, splitFileNames);
    }

One tiny nit: move the calculation of the length of the stream out of the loop, you only need to get that once.

EDIT: also, include @Pavel Gatilov's idea to invert the logic of the inner loop and search for each word in the line in the 12 million list.

回答2:

Several ideas:

I think it will be more efficient to split each line into words and look if each of several words appears in your word list. 10 lookups in a hashset is better than millions of searches of a substring. If you have composite keywords, make appropriate indexes: one that contains all single words that occur in the real keywords and another that contains all the real keywords.
Perhaps, loading strings into StringBuilder is better for replacing.
Update progress after, say 10000 lines processed, not after each one.
Process in background threads. It won't make it much faster, but the app will be responsible.
Parallelize the code, as Jeremy has suggested.

UPDATE

Here is a sample code that demonstrates the by-word index idea:

static void ReplaceWords()
{
  string inputFileName = null;
  string outputFileName = null;

  // this dictionary maps each single word that can be found
  // in any keyphrase to a list of the keyphrases that contain it.
  IDictionary<string, IList<string>> singleWordMap = null;

  using (var source = new StreamReader(inputFileName))
  {
    using (var target = new StreamWriter(outputFileName))
    {
      string line;
      while ((line = source.ReadLine()) != null)
      {
        // first, we split each line into a single word - a unit of search
        var singleWords = SplitIntoWords(line);

        var result = new StringBuilder(line);
        // for each single word in the line
        foreach (var singleWord in singleWords)
        {
          // check if the word exists in any keyphrase we should replace
          // and if so, get the list of the related original keyphrases
          IList<string> interestingKeyPhrases;
          if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
            continue;

          Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);

          // then process each of the keyphrases
          foreach (var interestingKeyphrase in interestingKeyPhrases)
          {
            // and replace it in the processed line if it exists
            result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
          }
        }

        // now, save the processed line
        target.WriteLine(result);
      }
    }
  }
}

private static string GetTargetValue(string interestingKeyword)
{
  throw new NotImplementedException();
}

static IEnumerable<string> SplitIntoWords(string keyphrase)
{
  throw new NotImplementedException();
}

The code shows the basic ideas:

We split both keyphrases and processed lines into equivalent units which may be efficiently compared: the words.
We store a dictionary that for any word quickly gives us references to all keyphrases that contain the word.
Then we apply your original logic. However, we do not do it for all 12 mln keyphrases, but rather for a very small subset of keyphrases that have at least a single-word intersection with the processed line.

I'll leave the rest of the implementation to you.

The code however has several issues:

The SplitIntoWords must actually normalize the words to some canonical form. It depends on the required logic. In the simplest case you'll probably be fine with whitespace-character splitting and lowercasing. But it may happen that you'll need a morphological matching - that would be harder (it's very close to full-text search tasks).
For the sake of the speed, it's likely to be better if the GetTargetValue method was called once for each keyphrase before processing the input.
If a lot of your keyphrases have coinciding words, you'll still have a signigicant amount of extra work. In that case you'll need to keep the positions of keywords in the keyphrases in order to use word distance calculation to exclude irrelevant keyphrases while processing an input line.
Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with both StringBuilder and string to find out the truth.
It's a sample after all. The design is not very good. I'd consider extracting some classes with consistent interfaces (e.g. KeywordsIndex).

来源：https://stackoverflow.com/questions/8620238/replace-long-list-words-in-a-big-text-file

标签

string

list

replace

text-processing