True definition of an English word?

问题

What would be the best definition of an English word?

What are the other cases of an English word than just \w+? Some may include \w+-\w+ or \w+'\w+; some may exclude cases like \b[0-9]+\b. But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify?

(Edit: broaden the question so it doesn't depend on regexp only.)

回答1:

I really don't think a regex is going to help you here, the problem with English (or any language for that matter) text is context. Without it you can be sure if what's between the word boundaries is text, a number, a random collection of characters, etc. For an NLP I think you are going to be selecting a subset of the language and looking for specific words rather than trying to extract all 'Words' from a string.

回答2:

The best way to check if a word is English is to look it up in a dictionary. If it's in an a dictionary of English words, than it is an english word. It is possible that a word could be in an English dictionary and a French dictionary also. For example 'me' is both a French and English word.

I'm sure you can find lots of downloadable dictionaries online. You can also make your own. For example, you could download the English version of Wikipedia and assume that all words found there are English words. You may or may not to filter out numbers.

A regular expression will not tell you whether a word is English. For instance xyvfg matches your pattern \w' but is certainly not an English word.

Edit: In theory, using English Phonology, it could be possible to tell whether a phonetic transcription of a word is pronounceable by an english speaker. There are lots of words pronounceable to english speakers which are not actually english words. This could take into account words that may appear in the english language in the future. However, translating between a phonetic transcription and text is quite a challenging problem as there can be many different spellings of the same phonetic transcription. I don't know if anyone has done anything like this. It could be an interesting theoretic excercise. I'm not sure this would be very useful in real world NLP though.

回答3:

Let's be concrete and try to solidify the ground by examples.

Is 'word' an English word?  YES

49th?  YES

NYSE?  YES

Résumé?  YES

Haight-Ashbury? YES/NO?

good-looking?  YES/NO?

P&G?  YES/NO?

1023?  YES/NO?

304-392-9999?  YES/NO?

3.14?  YES/NO?

回答4:

http://www.sussex.ac.uk/linguistics/documents/essay_-_what_is_a_word.pdf

回答5:

A true English word will almost never contain accents or foreign characters - so \w+ might capture more than you're after, although there are a number of words used in English that we've borrowed from other languages - most of us probably don't have the time or inclination to bother accenting them, tho'. I was even too lazy to write 'though' out in full there - \w+'\w+ wouldn't capture that. In general, so long as your \w+ is capturing your words correctly, I can't think of any other punctuation on top of - and ' that might be encountered mid-word.

回答6:

Your problem is called word tokenization. Take a look here:
http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

Stanford is a very famous NLP laboratory. They produces one of the most efficient parser for English. The page outlines some common tokenization problems like

Unusual domain specific token: MAS*H, C++, IP address ...
Hyphenation: co-education, Hewlett-Packard
Collocation: San Francisco, Los Angeles
Specific syntax ...
- Advertisements for air fares "San Francisco-Los Angeles"
- Omitted spaces etc...

The Penn Treebank Project also provides a simple sed script for word tokenization "that does a decent enough job on most corpora" here.

来源：https://stackoverflow.com/questions/3690195/true-definition-of-an-english-word

标签

regex

nlp