choosing right data structure to parse a file

问题

I have a csv file with contents in the following format:

CSE110, Mon, 1:00 PM, Fri, 1:00 PM
CSE114, Mon, 8:00 AM, Wed, 8:00 AM, Fri, 8:00 AM

which is basically course name followed by it's timings.

what's the best data structure to parse and store this data?

I tried using named tuples as follows:

CourseTimes = namedtuple('CourseTimes', 'course_name, day, start_time ')

But a single course can be scheduled on multiple days and time as shown for cse114 above. This can only be decided at run-time. How to handle this?

or else, Can I make use of Dictionary or List?

I am trying to solve a scheduling problem to assign TAs to courses. I might have to compare times to check for any collisions in the future

Also to complicate things up, the input file has other data as well which I need to parse. Basically the following is the format.

//Course times
CSE110, Mon, 1:00 PM, Fri, 1:00 PM
CSE114, Mon, 8:00 AM, Wed, 8:00 AM, Fri, 8:00 AM
....

//Course recitation times
CSE306, Mon, 2:30 PM
CSE307, Fri, 4:00 PM
...

//class strength
CSE101, 44, yes
CSE101, 115, yes
...

I need store all this in separate data structures I suppose. What could be the right reg-ex patterns for each of the category?

回答1:

Start with noting a few things about your data:

You have a number of unique strings (the courses)
After each course, there is a number of strings (the times the class meets per week)

With that, you have a series of unique keys that each have a number of values.

Sounds like a dictionary to me.

To get that data into a dictionary, start with reading the file. Next, you can either use regular expressions to select each [day], [hour]:[minutes] [AM/PM] section or plain old string.split() to break the line into sections by the commas. The course string is the key into the dictionary with the rest of the line as a tuple or list of values. Move onto the next line.

回答2:

{
    'CSE110': {'Mon': ['8: 00 AM'], 'Wed': ['8: 00 AM'], 'Fri': ['8: 00 AM'], 
    'CSE110': {'Mon': ['1: 00 PM'], 'Fri': ['1: 00 PM']}
}

A dictionary of this form. A course can have multiple slots for the same day.

When you read the csv file, you create for the course and that day(if it doesnt already exists) and assign it a single element list for the timing. If the value for the course and day is already present, you just append to the existing list. This means that course has more than one timings on the same day.

You don't need a regex to find the category of the input line. The first and second types that you have(i.e. single day and multiple days) can be found like

l = line.split(', ')
try:
    n = int(l[1]) # n = strength
except:
    #continue adding to dictionary since second element in the list is not an integer

来源：https://stackoverflow.com/questions/29132845/choosing-right-data-structure-to-parse-a-file

标签

python

regex

csv

data-structures

namedtuple