I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a \'tag\' list. The problem is that
Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.
For example you have this sentence.
mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;
This will give you values for two consecutive words. You can do this word three words also. Beware this requires O(n^3) memory.
You can also use a heap for storing the data like:
heap['for example']++;
heap['example you']++;