Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

自闭症网瘾萝莉.ら 提交于 2020-02-26 06:53:45

问题


Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table.

My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation.


回答1:


The reason you don't see options documented anywhere is that they are format-specific and developers can keep creating custom write formats with a new set of options.

However, for few supported formats I have listed the options as mentioned in the spark code itself:

  • CSVOptions
  • JDBCOptions
  • JSONOptions
  • ParquetOptions
  • TextOptions
  • OrcOptions
  • AvroOptions



回答2:


Take a look at https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala the class "DeltaOptions'

Currently, supported options are:

  • replaceWhere
  • mergeSchema
  • overwriteSchema
  • maxFilesPerTrigger
  • excludeRegex
  • ignoreFileDeletion
  • ignoreChanges
  • ignoreDeletes
  • optimizeWrite
  • dataChange
  • queryName
  • checkpointLocation
  • path
  • timestampAsOf
  • versionAsOf



回答3:


According to the source code you can specify the path option (indicates where to store the hive external data in hdfs, translated to 'location' in Hive DDL). Not sure you have other options associated with saveAsTable but I'll be searching for more.




回答4:


As per the latest spark documentation following are the options that can be passed while writing DataFrame to external storage using .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API

if you click on the source hyperlink on the right hand side in the documentation you can traverse and find details of the other not so clear arguments eg. format and options which are described under the class DataFrameWriter

so when the document reads options – all other string options it is referring to options which gives you following option as for spark 2.4.4

timeZone: sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.

and when it reads format – the format used to save it is referring to format(source)

Specifies the underlying output data source.

Parameters

source – string,

name of the data source, e.g. ‘json’, ‘parquet’.

hope this was helpful.




回答5:


The difference is between the versions.

We have the following in spark2:

createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()

saveAsTable is deprecated in spark 2.

Basically these are divided depending on the availability of the table. Please refer to the link




回答6:


saveAsTable(String tableName)

Saves the content of the DataFrame as the specified table.

FYI -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html



来源:https://stackoverflow.com/questions/31487254/spark-what-options-can-be-passed-with-dataframe-saveastable-or-dataframewriter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!