Idiomatic Scala way of deserializing delimited strings into case classes

Suppose I was dealing with a simple colon-delimited text protocol that looked something like:

Event:005003:information:2013 12 06 12 37 55:n3.swmml20861:1:Full client swmml20861 registered [entry=280 PID=20864 queue=0x4ca9001b]
RSET:m3node:AUTRS:1-1-24:A:0:LOADSHARE:INHIBITED:0
M3UA_IP_LINK:m3node:AUT001LKSET1:AUT001LK1:r
OPC:m3node:1-10-2(P):A7:NAT0
....

I'd like to deserialize each line as an instance of a case class, but in a type-safe way. My first attempt uses type classes to define 'read' methods for each possible type that I can encounter, in addition to the 'tupled' method on the case class to get back a function that can be applied to a tuple of arguments, something like the following:

case class Foo(a: String, b: Integer)

trait Reader[T] {
  def read(s: String): T
}

object Reader {
  implicit object StringParser extends Reader[String] { def read(s: String): String = s }
  implicit object IntParser extends Reader[Integer] { def read(s: String): Integer = s.toInt }
}

def create[A1, A2, Ret](fs: Seq[String], f: ((A1, A2)) => Ret)(implicit A1Reader: Reader[A1], A2Reader: Reader[A2]): Ret = {
  f((A1Reader.read(fs(0)), A2Reader.read(fs(1))))
}

create(Seq("foo", "42"), Foo.tupled) // gives me a Foo("foo", 42)

The problem though is that I'd need to define the create method for each tuple and function arity, so that means up to 22 versions of create. Additionally, this doesn't take care of validation, or receiving corrupt data.

As there is a Shapeless tag, a possible solution using it, but I'm not an expert and I guess one can do better :

First, about the lack of validation, you should simply have read return Try, or scalaz.Validation or just option if you do not care about an error message.

Then about boilerplate, you may try to use HList. This way you don't need to go for all the arities.

import scala.util._
import shapeless._

trait Reader[+A] { self =>
  def read(s: String) : Try[A]
  def map[B](f: A => B): Reader[B] = new Reader[B] {
    def read(s: String) = self.read(s).map(f)
  }
}    

object Reader {
  // convenience
  def apply[A: Reader] : Reader[A] = implicitly[Reader[A]]
  def read[A: Reader](s: String): Try[A] = implicitly[Reader[A]].read(s)

  // base types
  implicit object StringReader extends Reader[String] {
    def read(s: String) = Success(s)
  }
  implicit object IntReader extends Reader[Int] {
    def read(s: String) = Try {s.toInt}
  }

  // HLists, parts separated by ":"
  implicit object HNilReader extends Reader[HNil] {
    def read(s: String) = 
      if (s.isEmpty()) Success(HNil) 
      else Failure(new Exception("Expect empty"))
  }
  implicit def HListReader[A : Reader, H <: HList : Reader] : Reader[A :: H] 
  = new Reader[A :: H] {
    def read(s: String) = {
      val (before, colonAndBeyond) = s.span(_ != ':')
      val after = if (colonAndBeyond.isEmpty()) "" else colonAndBeyond.tail
      for {
        a <- Reader.read[A](before)
        b <- Reader.read[H](after)
      } yield a :: b
    }
  }

}

Given that, you have a reasonably short reader for Foo :

case class Foo(a: Int, s: String) 

object Foo {
  implicit val FooReader : Reader[Foo] = 
    Reader[Int :: String :: HNil].map(Generic[Foo].from _)
}

It works :

println(Reader.read[Foo]("12:text"))
Success(Foo(12,text))

Without scalaz and shapeless, I think the ideomatic Scala way to parse some input are Scala parser combinators. In your example, I would try something like this:

import org.joda.time.DateTime
import scala.util.parsing.combinator.JavaTokenParsers

val input =
  """Event:005003:information:2013 12 06 12 37 55:n3.swmml20861:1:Full client swmml20861 registered [entry=280 PID=20864 queue=0x4ca9001b]
    |RSET:m3node:AUTRS:1-1-24:A:0:LOADSHARE:INHIBITED:0
    |M3UA_IP_LINK:m3node:AUT001LKSET1:AUT001LK1:r
    |OPC:m3node:1-10-2(P):A7:NAT0""".stripMargin

trait LineContent
case class Event(number : Int, typ : String, when : DateTime, stuff : List[String]) extends LineContent
case class Reset(node : String, stuff : List[String]) extends LineContent
case class Other(typ : String, stuff : List[String]) extends LineContent

object LineContentParser extends JavaTokenParsers {
  override val whiteSpace=""":""".r

  val space="""\s+""".r
  val lineEnd = """"\n""".r  //"""\s*(\r?\n\r?)+""".r
  val field = """[^:]*""".r

  def stuff : Parser[List[String]] = rep(field)
  def integer : Parser[Int] = log(wholeNumber ^^ {_.toInt})("integer")

  def date : Parser[DateTime] = log((repsep(integer, space)  filter (_.length == 6))  ^^ (l =>
      new DateTime(l(0), l(1), l(2), l(3), l(4), l(5), 0)
    ))("date")

  def event : Parser[Event] = "Event" ~> integer ~ field ~ date ~ stuff ^^ {
    case number~typ~when~stuff => Event(number, typ, when, stuff)}

  def reset : Parser[Reset] = "RSET" ~> field ~ stuff ^^ { case node~stuff =>
    Reset(node, stuff)
  }

  def other : Parser[Other] = ("M3UA_IP_LINK" | "OPC") ~ stuff ^^ { case typ~stuff =>
    Other(typ, stuff)
  }

  def line : Parser[LineContent] = event | reset | other
  def lines = repsep(line, lineEnd)

  def parseLines(s : String) = parseAll(lines, s)
}

LineContentParser.parseLines(input)

The patterns in the parser combinators are self explanatory. I always convert each successfully parsed chunk as early as possible to an partial result. Then the partial results will be combined to the final result.

A hint for debugging: You can always add the log parser. It will print before and after when a rule is applied. Together with the given name (e.g. "date") it will also print the current position of the input source, where the rule is applied and when applicable the parsed partial result.

An example output looks like this:

trying integer at scala.util.parsing.input.CharSequenceReader@108589b
integer --> [1.13] parsed: 5003
trying date at scala.util.parsing.input.CharSequenceReader@cec2e3
trying integer at scala.util.parsing.input.CharSequenceReader@cec2e3
integer --> [1.30] parsed: 2013
trying integer at scala.util.parsing.input.CharSequenceReader@14da3
integer --> [1.33] parsed: 12
trying integer at scala.util.parsing.input.CharSequenceReader@1902929
integer --> [1.36] parsed: 6
trying integer at scala.util.parsing.input.CharSequenceReader@17e4dce
integer --> [1.39] parsed: 12
trying integer at scala.util.parsing.input.CharSequenceReader@1747fd8
integer --> [1.42] parsed: 37
trying integer at scala.util.parsing.input.CharSequenceReader@1757f47
integer --> [1.45] parsed: 55
date --> [1.45] parsed: 2013-12-06T12:37:55.000+01:00

I think this is an easy and maintainable way to parse input into well typed Scala objects. It is all in the core Scala API, hence I would call it "idiomatic". When typing the example code in an Idea Scala worksheet, completion and type information worked very well. So this way seems to well supported by the IDEs.

来源：https://stackoverflow.com/questions/20939716/idiomatic-scala-way-of-deserializing-delimited-strings-into-case-classes

标签

scala

scalaz

shapeless