Determine the difficulty of an english word

五迷三道 提交于 2019-11-28 16:49:05

问题


I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.

I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.

Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?

I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.


回答1:


Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.

If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.




回答2:


I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.




回答3:


I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...

Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.




回答4:


Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.




回答5:


depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.

That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.

Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.




回答6:


Crowd-source the answer.

  • Create an online 'game' that lists 10 words at random.
  • Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
  • Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
  • Repeat.

It might even be fun to play, you could get a language proficiency score at the end.




回答7:


There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors. (just do a search for "psycholinguistic database".




回答8:


Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself. You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values. It would be imo easier to find frequency data about each letter in your language .




回答9:


In his article on spell correction Peter Norvig uses a dictionary to count the number of occurrences of each word (and thus determine their frequency).

You could use this as a stepping stone :)

Also, frequency should probably influence the difficulty more than length... you would have to beta-test the game for that.




回答10:


In addition to metrics such as Flesch-Kincaid, you could try an approach based on the Dale-Chall readability formula, using lists of words that are familiar to readers of a particular level of ability.

Implementations of many of the readability formulae contain code for estimating the number of syllables in a word, which may also be useful.




回答11:


I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.




回答12:


Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.

Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Example:

Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):




回答13:


Without claiming to know anything about their algorithm, there is an API that returns a 1-10 scale word difficulty: TwinWord API

I have never used it, myself, though.



来源:https://stackoverflow.com/questions/5141092/determine-the-difficulty-of-an-english-word

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!