问题
I'm reading in a long list of words, and I made a node for every word in the list. Each node has an attribute 'word' for their position in the list.
I am trying to connect a node to the next node if the next node is the previous node, with an addition of just one letter
I also alphabetically ordered each word per character, so that CAT -> ACT
I want to draw an edge from each unique starting word, to all of the possible chains, so I can see all the possible chains in the list.
For example
A -> AN -> TAN -> RANT
However A --x-> T
This is my attempt
for i in range(0, G.number_of_nodes()-1):
if ( ( (len(G.node[i]['word'])+1) == len(G.node[i+1]['word']) ) and (G.node[i]['word'] in G.node[i+1]['word'])):
print G.node[i]['word'], G.node[i+1]['word']
Gave me this,
DGO DGOS DGOS DGOSS I IN ELLMS ELLMSS AEPRS AEPRSS INW DINW DINW DINWY
What the word list and the alphabetical list looks like
Why do I not see IN INW?
Also, AGNRT AGNRST should be on there but I don't understand why, along with a lot of other pairs
Where do you think I went wrong?
回答1:
The problem is that you are only comparing words that appear right next to each other in the list, i.e. words i
and i+1
, e.g. I
and IN
appear next to each other, as do WIN
and WIND
, but IN
and WIND
are far apart. It seems you want to compare all possible words, which requires a more sophisticated algorithm. Here's an idea:
- Make a dictionary where they keys are sorted words and the values are lists of actual words, e.g.
{"ACT": ["CAT", "ACT", "TAC], ...}
. Acollections.defaultdict(list)
will be useful for this. - Sort the full input list of words by length. You can use
list.sort(key=len)
assuming you have just a list of words. - Iterate through the list sorted by length. For each word, go through every subset of length
n-1
. Something likefor i in range(len(word)): process(word[:i] + word[i+1:])
. You may want to be careful about duplicates here. - For each subset, sort the subset and look it up in the dictionary. Make a link from every word in the dictionary's value (a list of actual words) to the bigger word.
回答2:
Looks like a formal languages problem. How do you handle looping nodes?
IN INW is in the list you gave.
AGNRT AGNRST are not in the list, because you started out with a single letter, that letter has to be in the next word for example I -> IN, but IN is not in AGNRT or AGNRST
回答3:
You seem to be comparing each node with just one other node, so
"IN" directly follows "I" in your wordlist, but "INW" is not directly after "IN"
回答4:
You can use the 3rd party python library, python-levenshtein
, to calculate the Levenshtein Distance which is the string edit distance. In your case, the only allowed 'edit' is the 'insertion' of the character on the next string/word on your list, so you will also need to verify that the length of the next word is 1 plus the previous word.
Here is the sample code that would achieve our stuff:
import Levenshtein as lvst
if len(word2) - len(word1) == 1 and lvst.distance(word1, word2) == 1:
print(word1, word2)
You can install python-levenshtein
by either apt-get
(systemwide) or pip
:
sudo apt-get install python-levenshtein
or
sudo apt-get install python3-levenshtein
or
pip install python-levenshtein
来源:https://stackoverflow.com/questions/33114074/python-check-string-contains-all-characters