I\'m looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my curren
I would suggest using a parser combinator library like parsy. Compared to using regexes, the result will not be as concise, but it will be much more readable and robust, while still being relatively light-weight.
Parsing is in general quite a hard task, and an approach that is good for people at beginner level for general programming might be hard to find.
EDIT:
Some actual example code that does minimal parsing of your supplied example. It does not pass to pandas, or even match up names to scores, or students to grades etc. - it just returns a hierarchy of objects starting with School at the top, with the relevant attributes as you would expect:
from parsy import string, regex, seq
import attr
@attr.s
class Student():
name = attr.ib()
number = attr.ib()
@attr.s
class Score():
score = attr.ib()
number = attr.ib()
@attr.s
class Grade():
grade = attr.ib()
students = attr.ib()
scores = attr.ib()
@attr.s
class School():
name = attr.ib()
grades = attr.ib()
integer = regex(r"\d+").map(int)
student_number = integer
score = integer
student_name = regex(r"[^\n]+")
student_def = seq(student_number.tag('number') << string(", "),
student_name.tag('name') << string("\n")).combine_dict(Student)
student_def_list = string("Student number, Name\n") >> student_def.many()
score_def = seq(student_number.tag('number') << string(", "),
score.tag('score') << string("\n")).combine_dict(Score)
score_def_list = string("Student number, Score\n") >> score_def.many()
grade_value = integer
grade_def = string("Grade = ") >> grade_value << string("\n")
school_grade = seq(grade_def.tag('grade'),
student_def_list.tag('students') << regex(r"\n*"),
score_def_list.tag('scores') << regex(r"\n*")
).combine_dict(Grade)
school_name = regex(r"[^\n]+")
school_def = string("School = ") >> school_name << string("\n")
school = seq(school_def.tag('name'),
school_grade.many().tag('grades')
).combine_dict(School)
def parse(text):
return school.many().parse(text)
This is much more verbose than a regex solution, but much closer to a declarative definition of your file format.