I have a large set of real-world text that I need to pull words out of to input into a spell checker. I\'d like to extract as many meaningful words as possible with
Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:
\b([a-zA-Z]+)\b
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.
You can the \b sequence, and others, over at the python manual
EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
(?!\d) # negative look-ahead for numbers
(?