Word frequency algorithm for natural language processing
Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle . What I'd like: ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc) preserve proper nouns ignore hyphenation, except for soft kind Reaching for the stars, these would be peachy: handling stemming & plurals (e.g. like, likes, liked, liking match the same result) grouping of adjectives (adverbs,