DataFrame explode list of JSON objects

后端 未结 1 1171
故里飘歌
故里飘歌 2020-12-16 07:35

I have JSON data in the following format:

{
     \"date\": 100
     \"userId\": 1
     \"data\": [
         {
             \"timeStamp\": 101,
             \         


        
相关标签:
1条回答
  • 2020-12-16 07:56

    The resulting schema is correct, but I get every value twice

    While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp and reading for each input row.

    My feeling is that there is something about the lazy evaluation

    No, it has nothing to do with lazy evaluation. The way you use explode is just wrong. To understand what is going on lets trace execution for date equal 100:

    val df100 = df.where($"date" === 100)
    

    step by step. First explode will generate two rows, one for 1 and one for 2:

    val df100WithReading = df100.withColumn("reading", explode(df("data.reading")))
    
    df100WithReading.show
    // +------------------+----+------+-------+
    // |              data|date|userId|reading|
    // +------------------+----+------+-------+
    // |[[1,101], [2,102]]| 100|     1|      1|
    // |[[1,101], [2,102]]| 100|     1|      2|
    // +------------------+----+------+-------+
    

    The second explode generate two rows (timeStamp equal 101 and 102) for each row from the previous step:

    val df100WithReadingAndTs = df100WithReading
      .withColumn("timeStamp", explode(df("data.timeStamp")))
    
    df100WithReadingAndTs.show
    // +------------------+----+------+-------+---------+
    // |              data|date|userId|reading|timeStamp|
    // +------------------+----+------+-------+---------+
    // |[[1,101], [2,102]]| 100|     1|      1|      101|
    // |[[1,101], [2,102]]| 100|     1|      1|      102|
    // |[[1,101], [2,102]]| 100|     1|      2|      101|
    // |[[1,101], [2,102]]| 100|     1|      2|      102|
    // +------------------+----+------+-------+---------+
    

    If you want correct results explode data and select afterwards:

    val exploded = df.withColumn("data", explode($"data"))
      .select($"userId", $"date",
        $"data".getItem("reading"),  $"data".getItem("timestamp"))
    
    exploded.show
    // +------+----+-------------+---------------+
    // |userId|date|data[reading]|data[timestamp]|
    // +------+----+-------------+---------------+
    // |     1| 100|            1|            101|
    // |     1| 100|            2|            102|
    // |     1| 200|            3|            201|
    // |     1| 200|            4|            202|
    // +------+----+-------------+---------------+
    
    0 讨论(0)
提交回复
热议问题