Pyparsing: extract variable length, variable content, variable whitespace substring

前端 未结 3 1651
误落风尘
误落风尘 2020-12-11 10:58

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to anoth

相关标签:
3条回答
  • 2020-12-11 11:27
    gleason = re.compile("gleason\d+\d=\d")
    scores = set()
    for record in records:
        for line in record.lower().split("\n"):
            if "gleason" in line:
                scores.add(gleason.match(line.replace(" ", "")).group(0)[7:])
    

    Or something

    0 讨论(0)
  • 2020-12-11 11:31

    Take a look at the SkipTo parse element in pyparsing. If you define a pyparsing structure for the num+num=num part, you should be able to use SkipTo to skip anything between "Gleason" and that. Roughly like this (untested pseuo-pyparsing):

    score = num + "+" + num + "=" num
    Gleason = "Gleason" + SkipTo(score) + score
    

    PyParsing by default skips whitespace anyway, and with SkipTo you can skip anything that doesn't match your desired format.

    0 讨论(0)
  • 2020-12-11 11:38

    Here is a sample to pull out the patient data and any matching Gleason data.

    from pyparsing import *
    num = Word(nums)
    accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
    accessionNumber = Combine("S" + num + "-" + num)("accNum")
    patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
    gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
    assert 'GLEASON 5+4=9' == gleason
    assert 'GLEASON SCORE:  3 + 3 = 6' == gleason
    
    patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
    assert '01/02/11  S11-4444 20/111-22-3333' == patientData
    
    partMatch = patientData("patientData") | gleason("gleason")
    
    lastPatientData = None
    for match in partMatch.searchString(data):
        if match.patientData:
            lastPatientData = match
        elif match.gleason:
            if lastPatientData is None:
                print "bad!"
                continue
            print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                            lastPatientData.patientData, match.gleason
                            )
    

    Prints:

    01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
    01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)
    
    0 讨论(0)
提交回复
热议问题