Splitting complex String Pattern (without regex) in a functional approach

爷,独闯天下 提交于 2019-12-11 09:54:33

问题


I am trying to split a string without regex in a more idiomatic functional approach.

case class Parsed(blocks: Vector[String], block: String, depth: Int)

def processChar(parsed: Parsed, c: Char): Parsed = {
  import parsed._
  c match {

    case '|'  if depth == 0
                =>  parsed.copy(block = "", blocks = blocks :+ block ,
                                  depth = depth)                          
    case '['  => parsed.copy(block = block + c,
                                  depth = depth + 1)
    case ']'  if depth == 1
                => parsed.copy( block = "", blocks = blocks :+ (block + c),
                                depth = depth - 1)
    case ']'  => parsed.copy(block = block + c,
                                  depth = depth - 1)
    case _    => parsed.copy(block = block + c)
  }
}

val s = "Str|[ts1:tssub2|ts1:tssub2]|BLANK|[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]|[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]|BLANK |BLANK |[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]|[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]|BLANK |BLANK |[s14|s15]"
val parsed = s.foldLeft(Parsed(Vector(), "", 0))(processChar)
parsed.blocks.size //20 
parsed.blocks foreach println

I would expect to get the following result (parsed.blocks.size should be 12).

Str
[ts1:tssub2|ts1:tssub2]
BLANK|
[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]
[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]
BLANK 
BLANK 
[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]
[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]
BLANK 
BLANK 
[s14|s15]

However result I am getting is (parsed.blocks.size is 20)

Str
[ts1:tssub2|ts1:tssub2]

BLANK
[INT1|X.X.X.X|INT2|BLANK|BLANK||X.X.X.X|[INT3|s1]]

[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]

BLANK
BLANK
[s2|s3|s4|INT16|INT17]
;[s5|s6|s7|INT18|INT19]

[[s8|s9|s10|INT20|INT21]|ts1:tssub2||]
;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK]

BLANK
BLANK
[s14|s15]

To my understanding this is slight variation of parenthesis balancing problem. However in this case ; would mean kind of continuation.

I have two questions in this case

1) How the extra entry /space after [ts1:tssub2|ts1:tssub2] came, also after

[INT1|X.X.X.X|INT2|BLANK|BLANK||X.X.X.X|[INT3|s1]] , [INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15] and ;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK]

in my result as well ?

2) Here at the moment [s2|s3|s4|INT16|INT17] and ;[s5|s6|s7|INT18|INT19]

go in as two different entries. However this should be merged as [s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]

a single entry[So does

[[s8|s9|s10|INT20|INT21]|ts1:tssub2||]

and

;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK])

as well]. Any clues to how to do so ?


回答1:


Question 1

The extra empty string block appears because the very previous case each time is

case ']'  if depth == 1

It adds an empty block and decreases the depth. Then we have

case '|'  if depth == 0

which also adds another empty block, pushing the previous empty one into the resulting Vector.


Before answering the second question, I would like to suggest another approach to the implementation of this parser, which is slightly more idiomatic. My major criticism about the current one is the usage of an intermediate object (Parsed) to wrap state and copying it in each and every case. Indeed, we do not need it: more frequent approach is to use a recursive function, especially when depth is involved.

So, without modifying significantly the processing of your cases, it can be represented as follows:

def parse(blocks: Seq[String],
          currentBlock: String,
          remaining: String,
          depth: Int): Seq[String] =
  if (remaining.isEmpty) {
    blocks
  } else {
    val curChar = remaining.head
    curChar match {
      case '|' if depth == 0 =>
        parse(blocks :+ currentBlock, "", remaining.tail, depth)
      case '[' =>
        parse(blocks, currentBlock + curChar, remaining.tail, depth + 1)
      case ']' =>
        if (depth == 1)
          parse(blocks :+ (currentBlock + curChar), "", remaining.tail, depth - 1)
        else
          parse(blocks, currentBlock + curChar, remaining.tail, depth - 1)
      case _ =>
        parse(blocks, currentBlock + curChar, remaining.tail, depth)
    }
  }

It produces exactly the same output as the original solution.

To fix the issue with empty blocks, we need to change case '|':

case '|' if depth == 0 =>
  val updatedBlocks = if (currentBlock.isEmpty) blocks
                      else blocks :+ currentBlock
  parse(updatedBlocks, "", remaining.tail, depth)

We just skip the current block if it contains an empty string.


Question 2

To merge the two blocks between ; char, we need to bring back one parsed block and return it into the currentBlock reference. This represents an additional case:

case ';' =>
  parse(blocks.init, blocks.last + curChar, remaining.tail, depth)

Now, after

val result = parse(Seq(), "", s, 0)
result.foreach(println)

The output is

Str
[ts1:tssub2|ts1:tssub2]
BLANK
[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]
[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]
BLANK 
BLANK 
[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]
[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]
BLANK 
BLANK 
[s14|s15]

And it looks very similar to what you were looking for.



来源:https://stackoverflow.com/questions/49782126/splitting-complex-string-pattern-without-regex-in-a-functional-approach

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!