How to find recurring word groups in text with C#? [closed]

前端未结

关注

 3  1026

清歌不尽 2021-01-06 00:15

3条回答

夕颜 (楼主)

2021-01-06 01:02

Here is a streaming approach that recursively builds up groups of size N (3 in this example) from an enumerable of words. It doesn't matter how you tokenize your input into words (I've used a simple regex in this example).

//tokenize input (enumerable of string)
var words = Regex.Matches(input, @"\w+").Cast().Select(m => m.Value);

//get word groups (enumerable of string[])
var groups = GetWordGroups(words, 3);

//do what you want with your groups; suppose you want to count them
var counts = new Dictionary(StringComparer.CurrentCultureIgnoreCase);
foreach (var group in groups.Select(g => string.Join(" ", g)))
{
    int count;
    counts.TryGetValue(group, out count);
    counts[group] = ++count;
}


IEnumerable GetWordGroups(IEnumerable words, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException();
    if (size == 1)
    {
        foreach (var word in words)
        {
            yield return new string[] { word };
        }

        yield break;
    }

    var prev = new string[0];

    foreach (var next in GetWordGroups(words, size - 1))
    {
        yield return next;

        //stream of groups includes all groups up to size - 1, but we only combine groups of size - 1
        if (next.Length == size - 1)
        {
            if (prev.Length == size - 1)
            {
                var group = new string[size];
                Array.Copy(prev, 0, group, 0, prev.Length);
                group[group.Length - 1] = next[next.Length - 1];
                yield return group;
            }

            prev = next;
        }
    }
}

One advantage with a streaming approach such as this is you minimize the number of strings that must be held in memory at any time (this reduces memory use for large bodies of text). Depending on how you receive your input, another optimization may be to use a TextReader to produce an enumeration of tokens as you read the input.

An example of the intermediate grouping output follows (each item is actually array of tokens, joined with a white space for output here):

The 
green 
The green 
algae 
green algae 
The green algae 
singular 
algae singular 
green algae singular 
green 
singular green 
algae singular green 
alga 
green alga 
singular green alga

0 讨论(0)

查看其它3个回答

热议问题