How to parse complex text files using Python?

前端 未结 4 1585
北海茫月
北海茫月 2020-12-02 14:59

I\'m looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my curren

4条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-02 15:34

    I would suggest using a parser combinator library like parsy. Compared to using regexes, the result will not be as concise, but it will be much more readable and robust, while still being relatively light-weight.

    Parsing is in general quite a hard task, and an approach that is good for people at beginner level for general programming might be hard to find.

    EDIT: Some actual example code that does minimal parsing of your supplied example. It does not pass to pandas, or even match up names to scores, or students to grades etc. - it just returns a hierarchy of objects starting with School at the top, with the relevant attributes as you would expect:

    from parsy import string, regex, seq
    import attr
    
    
    @attr.s
    class Student():
        name = attr.ib()
        number = attr.ib()
    
    
    @attr.s
    class Score():
        score = attr.ib()
        number = attr.ib()
    
    
    @attr.s
    class Grade():
        grade = attr.ib()
        students = attr.ib()
        scores = attr.ib()
    
    
    @attr.s
    class School():
        name = attr.ib()
        grades = attr.ib()
    
    
    integer = regex(r"\d+").map(int)
    student_number = integer
    score = integer
    student_name = regex(r"[^\n]+")
    student_def = seq(student_number.tag('number') << string(", "),
                      student_name.tag('name') << string("\n")).combine_dict(Student)
    student_def_list = string("Student number, Name\n") >> student_def.many()
    score_def = seq(student_number.tag('number') << string(", "),
                    score.tag('score') << string("\n")).combine_dict(Score)
    score_def_list = string("Student number, Score\n") >> score_def.many()
    grade_value = integer
    grade_def = string("Grade = ") >> grade_value << string("\n")
    school_grade = seq(grade_def.tag('grade'),
                       student_def_list.tag('students') << regex(r"\n*"),
                       score_def_list.tag('scores') << regex(r"\n*")
                       ).combine_dict(Grade)
    
    school_name = regex(r"[^\n]+")
    school_def = string("School = ") >> school_name << string("\n")
    school = seq(school_def.tag('name'),
                 school_grade.many().tag('grades')
                 ).combine_dict(School)
    
    
    def parse(text):
        return school.many().parse(text)
    

    This is much more verbose than a regex solution, but much closer to a declarative definition of your file format.

提交回复
热议问题