Spark writing Parquet array converts to a different datatype when loading into BigQuery

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

问题

Spark Dataframe Schema:

    StructType(
        [StructField("a", StringType(), False),
        StructField("b", StringType(), True),
        StructField("c" , BinaryType(), False),
        StructField("d", ArrayType(StringType(), False), True),
        StructField("e", TimestampType(), True)
        ])

When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.

BigQuery Schema:

            [
    {
        "type": "STRING",
        "name": "a",
        "mode": "REQUIRED"
    },
    {
        "type": "STRING",
        "name": "b",
        "mode": "NULLABLE"
    },
    {
        "type": "BYTES",
        "name": "c",
        "mode": "REQUIRED"
    },
    {
        "fields": [
        {
            "fields": [
            {
                "type": "STRING",
                "name": "element",
                "mode": "NULLABLE"
            }
            ],
            "type": "RECORD",
            "name": "list",
            "mode": "REPEATED"
        }
        ],
        "type": "RECORD",
        "name": "d",
        "mode": "NULLABLE"
    },
    {
        "type": "TIMESTAMP",
        "name": "e",
        "mode": "NULLABLE"
    }
    ]

Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?

来源：https://stackoverflow.com/questions/53674838/spark-writing-parquet-arraystring-converts-to-a-different-datatype-when-loadin

标签

apache-spark

pyspark

google-cloud-platform

google-bigquery

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!