Here is a streaming approach that recursively builds up groups of size N (3 in this example) from an enumerable of words. It doesn't matter how you tokenize your input into words (I've used a simple regex in this example).
//tokenize input (enumerable of string)
var words = Regex.Matches(input, @"\w+").Cast().Select(m => m.Value);
//get word groups (enumerable of string[])
var groups = GetWordGroups(words, 3);
//do what you want with your groups; suppose you want to count them
var counts = new Dictionary(StringComparer.CurrentCultureIgnoreCase);
foreach (var group in groups.Select(g => string.Join(" ", g)))
{
int count;
counts.TryGetValue(group, out count);
counts[group] = ++count;
}
IEnumerable GetWordGroups(IEnumerable words, int size)
{
if (size <= 0) throw new ArgumentOutOfRangeException();
if (size == 1)
{
foreach (var word in words)
{
yield return new string[] { word };
}
yield break;
}
var prev = new string[0];
foreach (var next in GetWordGroups(words, size - 1))
{
yield return next;
//stream of groups includes all groups up to size - 1, but we only combine groups of size - 1
if (next.Length == size - 1)
{
if (prev.Length == size - 1)
{
var group = new string[size];
Array.Copy(prev, 0, group, 0, prev.Length);
group[group.Length - 1] = next[next.Length - 1];
yield return group;
}
prev = next;
}
}
}
One advantage with a streaming approach such as this is you minimize the number of strings that must be held in memory at any time (this reduces memory use for large bodies of text). Depending on how you receive your input, another optimization may be to use a TextReader to produce an enumeration of tokens as you read the input.
An example of the intermediate grouping output follows (each item is actually array of tokens, joined with a white space for output here):
The
green
The green
algae
green algae
The green algae
singular
algae singular
green algae singular
green
singular green
algae singular green
alga
green alga
singular green alga