I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.
I hav
The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.
Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.