Assign SQL schema to Spark DataFrame

问题

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.

This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?

create_table_sql = '''
CREATE TABLE public.example (
  id LONG,
  example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
    path="s3a://"+s3_bucket_path,
    schema=schema
)\
.saveAsTable('public.example')

回答1:

Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:

from pyspark.sql.types import _parse_datatype_string

_parse_datatype_string("id: long, example: string")

This will create the next schema:

  StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

Or you may have a complex schema as well:

schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")

StructType(
  List(StructField(
    customers,ArrayType(
      StructType(
        List(
          StructField(id,LongType,true),
          StructField(name,StringType,true),
          StructField(address,StringType,true)
        )
      ),true),true)
  )
)

You can check for more examples here

来源：https://stackoverflow.com/questions/55972609/assign-sql-schema-to-spark-dataframe

标签

pyspark

apache-spark-sql