Scanning a HUGE JSON file for deserializable data in Scala

做~自己de王妃 提交于 2019-12-03 07:13:43

I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.

It is basically a simple Map->Reduce process with the help of stream parser.

Map (your advanceTo)

Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.

Reduce (your stream[Data])

Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).

Ryan Delucchi

Here is the current way I am solving the problem:

import collection.immutable.PagedSeq
import util.parsing.input.PagedSeqReader
import com.codahale.jerkson.Json
import collection.mutable

private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
private val clearAndStop = ']'

private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
  val str = new StringBuilder()
  var readerFinal = readerInitial

  while(!readerFinal.atEnd && !str.endsWith(text)) {
    str += readerFinal.first
    readerFinal = readerFinal.rest
  }

  if (!str.endsWith(text) || str.contains(clearAndStop))
    Taken(readerFinal, None)
  else
    Taken(readerFinal, Some(str.toString))
}

private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
  var taken = Taken(readerInitial, None)
  chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))

  taken
}

def getJsonData() : Seq[Data] = {
  var data = mutable.ListBuffer[Data]()
  var taken = takeUntil(fileContent, "\"foo\"")
  taken = takeUntil(taken.reader, ':', '[')

  var doneFirst = false
  while(taken.text != None) {
    if (!doneFirst)
      doneFirst = true
    else
      taken = takeUntil(taken.reader, ',')

    taken = takeUntil(taken.reader, '}')
    if (taken.text != None) {
      print(taken.text.get)
      places += Json.parse[Data](taken.text.get)
    }
  }

  data
}

case class Taken(reader: PagedSeqReader, text: Option[String])
case class Data(val a: Int, val b: Int, val c: Int)

Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!