How to parse mailbox file in Ruby?

前端 未结 3 685
再見小時候
再見小時候 2021-01-20 22:30

The Ruby gem rmail has methods to parse a mailbox file on local disk. Unfortunately this gem has broken (in Ruby 2.0.0). It might not get fixed, because folks

相关标签:
3条回答
  • 2021-01-20 22:51

    The good news is the Mbox format is really dead simple, though it's simplicity is why it was eventually replaced. Parsing a large mailbox file to extract a single message is not specially efficient.

    If you can split apart the mailbox file into separate strings, you can pass these strings to the Mail library for parsing.

    An example starting point:

    def parse_message(message)
      Mail.new(message)
    
      do_other_stuff!
    end
    
    message = nil
    
    while (line = STDIN.gets)
      if (line.match(/\AFrom /))
        parse_message(message) if (message)
        message = ''
      else
        message << line.sub(/^\>From/, 'From')
      end
    end
    

    The key is that each message starts with "From " where the space after it is key. Headers will be defined as From: and any line that starts with ">From" is to be treated as actually being "From". It's things like this that make this encoding method really inadequate, but if Maildir isn't an option, this is what you've got to do.

    0 讨论(0)
  • 2021-01-20 22:52

    The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a > prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.

    Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.

    A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.

    Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.

    0 讨论(0)
  • 2021-01-20 23:04

    You can use tmail parsing email boxes, but it was replaced by mail, but I can't really find a class that substitutes it. So you might want to keep along with tmail.

    EDIT: as @tadman pointed out, it should not be working with ruby 1.9. However you can port this class (and put it on github for everyone else use :-) )

    0 讨论(0)
提交回复
热议问题