The Ruby gem rmail
has methods to parse a mailbox file on local disk. Unfortunately this gem has broken (in Ruby 2.0.0). It might not get fixed, because folks
The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a >
prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.
Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length
header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.
A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.
Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.