Flattening JSON into Tabular Structure using Spark-Scala RDD only fucntion

后端 未结 3 1363
闹比i
闹比i 2021-01-07 06:01

I have nested JSON and like to have output in tabular structure. I am able to parse the JSON values individually , but having some problems in tabularizing it. I am able to

3条回答
  •  独厮守ぢ
    2021-01-07 06:17

    DataFrame and DataSet are much more optimized than rdd and there are a lot of options to try with to reach to the solution we desire.

    In my opinion, DataFrame is developed to make the developers comfortable viewing data in tabular form so that logics can be implemented with ease. So I always suggest users to use dataframe or dataset.

    Talking much less, I am posting you the solution below using dataframe. Once you have a dataframe, switching to rdd is very easy.

    Your desired solution is below (you will have to find a way to read json file as its done with json string below : thats an assignment for you :) good luck)

    import org.apache.spark.sql.functions._
    val json = """  { "level":{"productReference":{
    
                      "prodID":"1234",
    
                      "unitOfMeasure":"EA"
    
                   },
    
                   "states":[
                      {
                         "state":"SELL",
                         "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
                         "stockQuantity":{
                            "quantity":1400.0,
                            "stockKeepingLevel":"A"
                         }
                      },
                      {
                         "state":"HELD",
                         "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
                         "stockQuantity":{
                            "quantity":800.0,
                            "stockKeepingLevel":"B"
                         }
                      }
                   ] }}"""
    
    val rddJson = sparkContext.parallelize(Seq(json))
    var df = sqlContext.read.json(rddJson)
    df = df.withColumn("prodID", df("level.productReference.prodID"))
      .withColumn("unitOfMeasure", df("level.productReference.unitOfMeasure"))
      .withColumn("states", explode(df("level.states")))
      .drop("level")
    df = df.withColumn("state", df("states.state"))
      .withColumn("effectiveDateTime", df("states.effectiveDateTime"))
      .withColumn("quantity", df("states.stockQuantity.quantity"))
      .withColumn("stockKeepingLevel", df("states.stockQuantity.stockKeepingLevel"))
      .drop("states")
    df.show(false)
    

    This will give out put as

    +------+-------------+-----+-------------------------+--------+-----------------+
    |prodID|unitOfMeasure|state|effectiveDateTime        |quantity|stockKeepingLevel|
    +------+-------------+-----+-------------------------+--------+-----------------+
    |1234  |EA           |SELL |2015-10-09T00:55:23.6345Z|1400.0  |A                |
    |1234  |EA           |HELD |2015-10-09T00:55:23.6345Z|800.0   |B                |
    +------+-------------+-----+-------------------------+--------+-----------------+
    

    Now that you have desired output as dataframe converting to rdd is just calling .rdd

    df.rdd.foreach(println)
    

    will give output as below

    [1234,EA,SELL,2015-10-09T00:55:23.6345Z,1400.0,A]
    [1234,EA,HELD,2015-10-09T00:55:23.6345Z,800.0,B]
    

    I hope this is helpful

提交回复
热议问题