Grouping of CFG grammar rules sentencewise

问题

Below specified rules are generated for each sentence. We have to group them for each sentence. The input is in file. Output also should be in file

sentenceid=2

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP
NP--->N_NNP
NP--->N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU
NP--->N_NN
VGF--->V_VM_VF

sentenceid=4

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP_O_M
VGF--->V_VM_VF

The above section containing input ,that is actually grammar for each sentence. I want to group adjacent rules sentence wise. Output should be like below.

sentenceid=2

NP--->N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU N_NN
VGF--->V_VM_VF

senetnceid=4

NP--->N_NNP N_NN_S_NU N_NNP_O_M
VGF--->V_VM_VF

How can I implement this? I need almost 1000 sentences rules for probability calculation. This is the CFG grammar for each sentence, I want to group adjacent rules sentence-wise.

回答1:

How about this: considering sentence are in different files.

#!/usr/bin/python

import re
marker = '--->'

def parse_it(sen):
    total_dic = dict()
    marker_memory = ''
    with open(sen, 'r') as fh:
        mem = None
        lo = list()
        for line in fh.readlines():
            if line.strip():
                match = re.search('(sentenceid=\d+)', line)
                if match:
                    if mem and lo:
                        total_dic[marker_memory].append(lo)
                    marker_memory = match.group(0)
                    total_dic[marker_memory] = []
                else:
                    k,v = line.strip().split(marker)
                    k,v = [ x.strip() for x in [k,v]]
                    if not mem or mem == k:
                        lo.append((k,v))
                        mem = k
                    else:
                        total_dic[marker_memory].append(lo)
                        lo = [(k,v)]
                        mem = k
        #total_dic[marker_memory].append(lo)
    return total_dic

dic = parse_it('sentence')
for kin,lol in dic.iteritems():
    print
    print kin
    for i in lol:
        k,v = zip(*i)
        print '%s%s %s' % (k[0],marker,' '.join(v))

Output:

sentenceid=3
VGF---> V_VM_VF
NP---> N_NN
VGNF---> V_VM_VNF
JJP---> JJ
NP---> N_NN_S_NU N_NN
VGF---> V_VM_VF

sentenceid=2
NP---> N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF---> V_VM_VF

sentenceid=4
VGF---> V_VM_VF
NP---> N_NNP N_NN_S_NU N_NNP_O_M

来源：https://stackoverflow.com/questions/21472527/grouping-of-cfg-grammar-rules-sentencewise

标签

tree

parse-tree