I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
Would that be acceptable solution?
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""{"id":1,"name":"some name","problem_field":"{\"height\":180,\"weight\":80}"}"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = sqlContext.read.json(unescapedJsons)
dfJsons.printSchema()
// Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- problem_field: struct (nullable = true)
| |-- height: long (nullable = true)
| |-- weight: long (nullable = true)