Pyparsing: extract variable length, variable content, variable whitespace substring

前端 未结 3 1660
误落风尘
误落风尘 2020-12-11 10:58

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to anoth

3条回答
  •  不知归路
    2020-12-11 11:38

    Here is a sample to pull out the patient data and any matching Gleason data.

    from pyparsing import *
    num = Word(nums)
    accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
    accessionNumber = Combine("S" + num + "-" + num)("accNum")
    patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
    gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
    assert 'GLEASON 5+4=9' == gleason
    assert 'GLEASON SCORE:  3 + 3 = 6' == gleason
    
    patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
    assert '01/02/11  S11-4444 20/111-22-3333' == patientData
    
    partMatch = patientData("patientData") | gleason("gleason")
    
    lastPatientData = None
    for match in partMatch.searchString(data):
        if match.patientData:
            lastPatientData = match
        elif match.gleason:
            if lastPatientData is None:
                print "bad!"
                continue
            print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                            lastPatientData.patientData, match.gleason
                            )
    

    Prints:

    01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
    01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)
    

提交回复
热议问题