Config file to define JSON Schema Structure in PySpark

后端 未结 2 1545
盖世英雄少女心
盖世英雄少女心 2020-12-07 01:55

I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below

schema = StructType([
    StructFiel         


        
2条回答
  •  悲哀的现实
    2020-12-07 02:13

    You can create a JSON file named schema.json in the below format

    {
      "fields": [
        {
          "metadata": {},
          "name": "first_fields",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "double_field",
          "nullable": true,
          "type": "double"
        }
      ],
      "type": "struct"
    }
    

    Create a struct schema from reading this file

    rdd = spark.sparkContext.wholeTextFiles("s3:///schema.json")
    text = rdd.collect()[0][1]
    dict = json.loads(str(text))
    custom_schema = StructType.fromJson(dict)
    

    After that, you can use struct as a schema to read the JSON file

    val df=spark.read.json("path", custom_schema)
    

提交回复
热议问题