I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a \'tag\' list. The problem is that
One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA). NFA
Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.