Text segmentation: dictionary-based word splitting [closed]
Background Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary ( lexicon.csv ) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.) Source Code TextSegmenter.java @ http://pastebin.com/taXyE03L SortableValueMap.java @ http:/