Assign SQL schema to Spark DataFrame

百般思念 提交于 2019-12-23 03:27:16

问题


I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.

This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?

create_table_sql = '''
CREATE TABLE public.example (
  id LONG,
  example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
    path="s3a://"+s3_bucket_path,
    schema=schema
)\
.saveAsTable('public.example')

回答1:


Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:

from pyspark.sql.types import _parse_datatype_string

_parse_datatype_string("id: long, example: string")

This will create the next schema:

  StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

Or you may have a complex schema as well:

schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")

StructType(
  List(StructField(
    customers,ArrayType(
      StructType(
        List(
          StructField(id,LongType,true),
          StructField(name,StringType,true),
          StructField(address,StringType,true)
        )
      ),true),true)
  )
)

You can check for more examples here



来源:https://stackoverflow.com/questions/55972609/assign-sql-schema-to-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!