Using regex to transform data into a dictionary in Python

给你一囗甜甜゛ 提交于 2019-11-29 15:25:21

\r is not a valid character class, I think you meant to use \s instead. You can reduce the groups if you don't use them either.

But most of all, you need to extract your groups correctly:

match = re.search(r'>(\w+)\s+(\w+)', line)
if match:
    tag, gene = match.groups()
    myDict[tag] = gene

By creating only two capturing groups, we can more simply extract those two with .groups() and directly assign them to two variables, tag and gene.

However, reading up on the FASTA format seems to indicate this is a multi-line format with the tag on one line, the gene data on multiple lines after that. In that case your \r was meant to match the newline. This won't work as you read the file one line at a time.

It would be much simpler to read that format without regular expressions like so:

myDict = {}

with open('d.fasta', 'rU') as fileData:
    tag = None
    for line in fileData:
        line = line.strip()
        if not line:
            continue
        if line[0] == '>':
            tag = line[1:]
            myDict[tag] = ''
        else:
            assert tag is not None, 'Invalid format, found gene without tag'
            myDict[tag] += line

print myDict

This reads the file line by line, detecting tags based on the starting > character, then reads multiple lines of gene information collecting it into your dictionary under the most-recently read tag.

Note the rU mode; we open the file using python's universal newlines mode, to handle whatever newline convention was used to create the file.

Last but not least; take a look at the BioPy project; their Bio.SeqIO module handles FASTA plus many other formats perfectly.

Two errors I see:

Your regex is probably wrong. It's unlikely your FASTA input actually contains a bare carriage return (\r), so your regex won't match anything. Hence the if match: test is always false, so nothing happens.

Further, when processing each match: You are adding the first character of the gene (which is whitespace) as a key and the second character as the value.

You probably meant to use groups 2 and 4 respectively:

myDict[match.group(2)] = match.group(4)

dont use a regex for this ...

class FASTA(object):
    def __init__(self,data):
        self.data = data.strip().splitlines()
        self.desc = self.data[0]
        self.sequence = "".join(self.data[1:]).replace(" ","")#get rid of spaces
    def  GetCodons(self):
        return [self.sequence[i:i+3] for i in range(0,len(self.sequence),3)]
    def __str__(self):
        return "DESC:'%s'\nSEQ:'%s'"%(self.desc,self.sequence)

with open("data.fasta") as f:
      data = f.read()
parts = data.split(">")
for p in parts[1:]:
    f= FASTA(p)
    print f
    print f.GetCodons()

Unless your file is too big to fit in memory (which I guess it is not), the whole thing is as simple as

with open('d.fasta') as fp:
    myDict = dict(re.findall(r'(?m)^>(\w+)\s+^(\S+)', fp.read()))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!