How to specify only particular fields using read.schema in JSON : SPARK Scala

谁说我不能喝 提交于 2019-12-04 20:02:29

It can load with following code with predefined schema, spark don't need to go through the file in ZIP file. The code in the question has ambiguity.

import org.apache.spark.sql.types._

val input = StructType(
                Array(
                    StructField("inputType",StringType,true), 
                    StructField("originalRating",LongType,true), 
                    StructField("processed",BooleanType,true), 
                    StructField("rating",LongType,true), 
                    StructField("score",DoubleType,true), 
                    StructField("methodId",StringType,true)
                )
            )

 val schema = StructType(Array(
    StructField("requestId",StringType,true),
    StructField("siteName",StringType,true),
    StructField("model",StringType,true),
    StructField("inputs",
        ArrayType(input,true),
                true)
    )
)

val  records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")

Not all the fields need to be provided. While it's good to provide all if possible.

Spark try best to parse all, if some row is not valid. It will add _corrupt_record as a column which contains the whole row. While if it's plained json file file.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!