The general algorithm is going to go like this:
- Obtain Text
- Strip punctuation, special characters, etc.
- Strip "simple" words
- Split on Spaces
- Loop Over Split Text
- Add word to Array/HashTable/Etc if it doesn't exist;
if it does, increment counter for that word
The end result is a frequency count of all words in the text. You can then take these values and divide by the total number of words to get a percentage of frequency. Any further processing is up to you.
You're also going to want to look into Stemming. Stemming is used to reduce words to their root. For example going => go, cars => car, etc.
An algorithm like this is going to be common in spam filters, keyword indexing and the like.