Grouping of CFG grammar rules sentencewise

别来无恙 提交于 2020-01-06 06:49:05

问题


Below specified rules are generated for each sentence. We have to group them for each sentence. The input is in file. Output also should be in file

sentenceid=2

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP
NP--->N_NNP
NP--->N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU
NP--->N_NN
VGF--->V_VM_VF

sentenceid=4

NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP_O_M
VGF--->V_VM_VF

The above section containing input ,that is actually grammar for each sentence. I want to group adjacent rules sentence wise. Output should be like below.

sentenceid=2

NP--->N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF--->V_VM_VF

sentenceid=3

NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU N_NN
VGF--->V_VM_VF

senetnceid=4

NP--->N_NNP N_NN_S_NU N_NNP_O_M
VGF--->V_VM_VF

How can I implement this? I need almost 1000 sentences rules for probability calculation. This is the CFG grammar for each sentence, I want to group adjacent rules sentence-wise.


回答1:


How about this: considering sentence are in different files.

#!/usr/bin/python

import re
marker = '--->'

def parse_it(sen):
    total_dic = dict()
    marker_memory = ''
    with open(sen, 'r') as fh:
        mem = None
        lo = list()
        for line in fh.readlines():
            if line.strip():
                match = re.search('(sentenceid=\d+)', line)
                if match:
                    if mem and lo:
                        total_dic[marker_memory].append(lo)
                    marker_memory = match.group(0)
                    total_dic[marker_memory] = []
                else:
                    k,v = line.strip().split(marker)
                    k,v = [ x.strip() for x in [k,v]]
                    if not mem or mem == k:
                        lo.append((k,v))
                        mem = k
                    else:
                        total_dic[marker_memory].append(lo)
                        lo = [(k,v)]
                        mem = k
        #total_dic[marker_memory].append(lo)
    return total_dic

dic = parse_it('sentence')
for kin,lol in dic.iteritems():
    print
    print kin
    for i in lol:
        k,v = zip(*i)
        print '%s%s %s' % (k[0],marker,' '.join(v))

Output:

sentenceid=3
VGF---> V_VM_VF
NP---> N_NN
VGNF---> V_VM_VNF
JJP---> JJ
NP---> N_NN_S_NU N_NN
VGF---> V_VM_VF

sentenceid=2
NP---> N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF---> V_VM_VF

sentenceid=4
VGF---> V_VM_VF
NP---> N_NNP N_NN_S_NU N_NNP_O_M


来源:https://stackoverflow.com/questions/21472527/grouping-of-cfg-grammar-rules-sentencewise

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!