Replace with abbreviations from dictionary using Python

问题

I'm trying to replace words like 'rna' with 'ribonucleic acid' from a dictionary of abbreviations. I tried writing the following, but it doesn't replace the abbreviations.

import csv,re
outfile = open ("Dict.txt", "w")
with open('Dictionary.csv', mode='r') as infile:
    reader = csv.reader(infile)
    mydict = {rows[0]:rows[1] for rows in reader}
    print >> outfile, mydict
out = open ("out.txt", "w")
ss = open ("trial.csv", "r").readlines()
s = str(ss)
def process(s):
    da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', s ) )
    print >> out, da
process(s)

A sample trial.csv file would be

A,B,C,D
RNA,lung cancer,15,biotin
RNA,lung cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,lung cancer,15,biotin

Sample Dictionary.csv:

rna,ribonucleic acid
rnd,radical neck dissection
rni,recommended nutrient intake
rnp,ribonucleoprotein

My output file should have 'RNA' replaced by 'ribonucleic acid'

回答1:

I think this line s = str(ss) is causing the problem - the list that was created just became a string!

Try this instead:

def process(ss):
    for line in ss:
        da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', line ) )
        print >> out, da

process(ss)

回答2:

I'm trying to replace 'RNA' but my dictionary has 'rna'. Is there a way I can ignore the case.

Sure. Just call casefold on each key while creating the dictionary, and again while looking up values:

mydict = {rows[0].casefold(): rows[1] for rows in reader}

# ...

da = ''.join( mydict.get(word.casefold(), word) for word in re.split( '(\W+)', s ) )

If you're using an older version of Python that doesn't have casefold (IIRC, it was added in 2.7 and 3.2, but it may have been later than that…), use lower instead. It won't always do the right thing for non-English characters (e.g., 'ß'.casefold() is 'ss', while 'ß'.lower() is 'ß'), but it seems like that's OK for your application. (If it's not, you have to either write something more complicated with unicodedata, or find a third-party library.)

Also, I don't want it to replace 'corna' (I know such a word doesn't exist, but I want to make sure it doesn't happen) with 'coribonucleic acid'.

Well, you're already doing that with your re.split, which splits on any "non-word" characters; you then look up each resulting word separtely. Since corna won't be in the dict, it won't be replaced. (Although note that re's notion of "word" characters may not actually be what you want—it includes underscores and digits as part of a word, so rna2dna won't match, while a chunk of binary data like s1$_2(rNa/ might.)

You've also got another serious problem in your code:

ss = open ("trial.csv", "r").readlines()
s = str(ss)

Calling readlines means that ss is going to be a list of lines. Calling str on that list means that s is going to be a big string with [, then the repr of each line (with quotes around it, backslash escapes within it, etc.) separated by commas, then ]. You almost certainly don't want that. Just use read() if you want to read the whole file into a string as-is.

And you appear to have a problem in your data, too:

rna,ibonucleic acid

If you replace rna with ibonucleic acid, and so forth, you're going to have some hard-to-read output. If this is really your dictionary format, and the dictionary's user is supposed to infer some logic, e.g., that the first letter gets copied from the abbreviation, you have to write that logic. For example:

def lookup(word):
    try:
        return word[0] + mydict[word.casefold()]
    except KeyError:
        return word
da = ''.join(lookup(word) for word in re.split('(\W+), s))

Finally, it's a bad idea to use unescaped backslashes in a string literal. In this case, you get away with it, because Python happens to not have a meaning for \W, but that's not always going to be true. The best way around this is to use raw string literals, like r'(\W+)'.

来源：https://stackoverflow.com/questions/26856001/replace-with-abbreviations-from-dictionary-using-python

标签

python

csv

dictionary

replace

abbreviation