Use Scala parser combinator to parse CSV files

前端 未结 3 1455
陌清茗
陌清茗 2020-11-30 21:09

I\'m trying to write a CSV parser using Scala parser combinators. The grammar is based on RFC4180. I came up with the following code. It almost works, but I cannot get it to

相关标签:
3条回答
  • 2020-11-30 21:11

    What you missed is whitespace. I threw in a couple bonus improvements.

    import scala.util.parsing.combinator._
    
    object CSV extends RegexParsers {
      override protected val whiteSpace = """[ \t]""".r
    
      def COMMA   = ","
      def DQUOTE  = "\""
      def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
      def CR      = "\r"
      def LF      = "\n"
      def CRLF    = "\r\n"
      def TXT     = "[^\",\r\n]".r
    
      def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
      def record: Parser[List[String]] = rep1sep(field, COMMA)
      def field: Parser[String] = (escaped|nonescaped)
      def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
      def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
    
      def parse(s: String) = parseAll(file, s) match {
        case Success(res, _) => res
        case _ => List[List[String]]()
      }
    }
    
    0 讨论(0)
  • 2020-11-30 21:18

    With Scala Parser Combinators library out of the Scala standard library starting from 2.11 there is no good reason not to use the much more performant Parboiled2 library. Here is a version of the CSV parser in Parboiled2's DSL:

    /*  based on comments in https://github.com/sirthias/parboiled2/issues/61 */
    import org.parboiled2._
    case class Parboiled2CsvParser(input: ParserInput, delimeter: String) extends Parser {
      def DQUOTE = '"'
      def DELIMITER_TOKEN = rule(capture(delimeter))
      def DQUOTE2 = rule("\"\"" ~ push("\""))
      def CRLF = rule(capture("\r\n" | "\n"))
      def NON_CAPTURING_CRLF = rule("\r\n" | "\n")
    
      val delims = s"$delimeter\r\n" + DQUOTE
      def TXT = rule(capture(!anyOf(delims) ~ ANY))
      val WHITESPACE = CharPredicate(" \t")
      def SPACES: Rule0 = rule(oneOrMore(WHITESPACE))
    
      def escaped = rule(optional(SPACES) ~
        DQUOTE ~ (zeroOrMore(DELIMITER_TOKEN | TXT | CRLF | DQUOTE2) ~ DQUOTE ~
        optional(SPACES)) ~> (_.mkString("")))
      def nonEscaped = rule(zeroOrMore(TXT | capture(DQUOTE)) ~> (_.mkString("")))
    
      def field = rule(escaped | nonEscaped)
      def row: Rule1[Seq[String]] = rule(oneOrMore(field).separatedBy(delimeter))
      def file = rule(zeroOrMore(row).separatedBy(NON_CAPTURING_CRLF))
    
      def parsed() : Try[Seq[Seq[String]]] = file.run()
    }
    
    0 讨论(0)
  • 2020-11-30 21:29

    The default whitespace for RegexParsers parsers is \s+, which includes new lines. So CR, LF and CRLF never get a chance to be processed, as it is automatically skipped by the parser.

    0 讨论(0)
提交回复
热议问题