Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

问题

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table.

My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation.

回答1:

The reason you don't see options documented anywhere is that they are format-specific and developers can keep creating custom write formats with a new set of options.

However, for few supported formats I have listed the options as mentioned in the spark code itself:

CSVOptions
JDBCOptions
JSONOptions
ParquetOptions
TextOptions
OrcOptions
AvroOptions

回答2:

Take a look at https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala the class "DeltaOptions'

Currently, supported options are:

replaceWhere
mergeSchema
overwriteSchema
maxFilesPerTrigger
excludeRegex
ignoreFileDeletion
ignoreChanges
ignoreDeletes
optimizeWrite
dataChange
queryName
checkpointLocation
path
timestampAsOf
versionAsOf

回答3:

According to the source code you can specify the path option (indicates where to store the hive external data in hdfs, translated to 'location' in Hive DDL). Not sure you have other options associated with saveAsTable but I'll be searching for more.

回答4:

As per the latest spark documentation following are the options that can be passed while writing DataFrame to external storage using .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API

if you click on the source hyperlink on the right hand side in the documentation you can traverse and find details of the other not so clear arguments eg. format and options which are described under the class DataFrameWriter

so when the document reads options – all other string options it is referring to options which gives you following option as for spark 2.4.4

timeZone: sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.

and when it reads format – the format used to save it is referring to format(source)

Specifies the underlying output data source.

Parameters

source – string,

name of the data source, e.g. ‘json’, ‘parquet’.

hope this was helpful.

回答5:

The difference is between the versions.

We have the following in spark2:

createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()

saveAsTable is deprecated in spark 2.

Basically these are divided depending on the availability of the table. Please refer to the link

回答6:

saveAsTable(String tableName)

Saves the content of the DataFrame as the specified table.

FYI -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html

来源：https://stackoverflow.com/questions/31487254/spark-what-options-can-be-passed-with-dataframe-saveastable-or-dataframewriter

标签

scala

Hadoop

apache-spark

Hive

parquet