Spark read parquet with custom schema

问题

I'm trying to import data with parquet format with custom schema but it returns : TypeError: option() missing 1 required positional argument: 'value'

   ProductCustomSchema = StructType([
        StructField("id_sku", IntegerType(), True),
        StructField("flag_piece", StringType(), True),
        StructField("flag_weight", StringType(), True),
        StructField("ds_sku", StringType(), True),
        StructField("qty_pack", FloatType(), True)])

def read_parquet_(path, schema) : 
    return spark.read.format("parquet")\
                             .option(schema)\
                             .option("timestampFormat", "yyyy/MM/dd HH:mm:ss")\
                             .load(path)

product_nomenclature = 'C:/Users/alexa/Downloads/product_nomenc'
product_nom = read_parquet_(product_nomenclature, ProductCustomSchema)

回答1:

As mentioned in the comments you should change .option(schema) to .schema(schema). option() requires you to specify a key (the name of the option you're setting) and a value (what value you want to assign to that option). You are getting the TypeError because you were just passing a variable called schema to option without specifying what that option you were actually trying to set with that variable.

The QueryExecutionException you posted in the comments is being raised because the schema you've defined in your schema variable does not match the data in your DataFrame. If you're going to specify a custom schema you must make sure that schema matches the data you are reading. In your example the column id_sku is stored as a BinaryType, but in your schema you're defining the column as an IntegerType. pyspark will not try to reconcile differences between the schema you provide and what the actual types are in the data and an exception will be thrown.

To fix your error make sure the schema you're defining correctly represents your data as it is stored in the parquet file (i.e. change the datatype of id_sku in your schema to be BinaryType). The benefit to doing this is you get a slight performance gain by not having to infer the file schema each time the parquet file is read.

来源：https://stackoverflow.com/questions/52387021/spark-read-parquet-with-custom-schema

标签

apache-spark

pyspark

apache-spark-sql