get list of anagrams from a dictionary

后端未结

关注

 5  773

生来不讨喜 2020-12-01 19:58

Basically, Anagrams are like permutation of string.E.g stack ,sackt ,stakc all are anagrams of stack (thought above words

5条回答

死守一世寂寞 (楼主)

2020-12-01 20:22
One possible hash function could be (assuming english words only) a sorted count of the number of occurrences of each letter. So for "anagram" you would generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r',1)].

Alternatively you could get an inexact grouping by generating a bitmask from your word where for bits 0-25 each bit represented the presence or absence of that letter (bit 0 representing 'a' through to bit 25 representining 'z'). But then you'd have to do a bit more processing to split each hashed group further to distinguish e.g. "to" from "too".

Do either of these ideas help? Any particular implementation language in mind (I could do C++, python or Scala)?

Edit: added some example Scala code and output:

OK: I'm in Scala mode at the moment, so I've knocked something up to do what you ask, but (ahem) it may not be very clear if you're not that familiar with Scala or functional programming.

Using a big list of english words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt

I run this Scala code on them (takes about 5 seconds using Scala 2.9 in script mode, including time to compile, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).
```
// Hashing function to go from a word to a sorted list of letter counts
def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1)


// Read all words from file, one word per line
val lines = scala.io.Source.fromFile("2of12.txt").getLines

// Go from list of words to list of (hashed word, word)
val hashed = lines.map( l => (toHash(l), l) ).toList

// Group all the words by hash (hence group all anagrams together)
val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) )

// Sort the resultant anagram sets so the largest come first
val sorted = grouped.toList.sortWith( _._2.size > _._2.size )

for ( set <- sorted.slice(0, 10) )
{
    println( set._2 )
}
```
This dumps out the first 10 sets of anagrams (sets with the most members first) being:
```
List(caret, cater, crate, react, trace)
List(reins, resin, rinse, risen, siren)
List(luster, result, rustle, sutler, ulster)
List(astir, sitar, stair, stria, tarsi)
List(latrine, ratline, reliant, retinal)
List(caper, crape, pacer, recap)
List(merit, miter, remit, timer)
List(notes, onset, steno, stone)
List(lair, liar, lira, rail)
List(drawer, redraw, reward, warder)
```
Note that this uses the first suggestion (list of counts of letters) not the more complicated bitmask method.

Edit 2: You can replace the hash function with a simple sort on the chars of each word (as suggested by JAB) and get the same result with clearer/faster code:
```
def toHash(b:String) = b.toList.sortWith(_<_)
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...