问题
Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable
or DataFrameWriter.options
and they would affect the saving of a Hive table.
My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation.
回答1:
The reason you don't see options
documented anywhere is that they are format-specific and developers can keep creating custom write formats with a new set of options
.
However, for few supported formats I have listed the options as mentioned in the spark code itself:
- CSVOptions
- JDBCOptions
- JSONOptions
- ParquetOptions
- TextOptions
- OrcOptions
- AvroOptions
回答2:
Take a look at https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala the class "DeltaOptions'
Currently, supported options are:
- replaceWhere
- mergeSchema
- overwriteSchema
- maxFilesPerTrigger
- excludeRegex
- ignoreFileDeletion
- ignoreChanges
- ignoreDeletes
- optimizeWrite
- dataChange
- queryName
- checkpointLocation
- path
- timestampAsOf
- versionAsOf
回答3:
According to the source code you can specify the path
option (indicates where to store the hive external data in hdfs, translated to 'location' in Hive DDL).
Not sure you have other options associated with saveAsTable but I'll be searching for more.
回答4:
As per the latest spark documentation following are the options that can be passed while writing DataFrame to external storage using .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API
if you click on the source hyperlink on the right hand side in the documentation you can traverse and find details of the other not so clear arguments eg. format and options which are described under the class DataFrameWriter
so when the document reads options – all other string options
it is referring to options which gives you following option as for spark 2.4.4
timeZone: sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.
and when it reads format – the format used to save
it is referring to format(source)
Specifies the underlying output data source.
Parameters
source – string,
name of the data source, e.g. ‘json’, ‘parquet’.
hope this was helpful.
回答5:
The difference is between the versions.
We have the following in spark2:
createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()
saveAsTable is deprecated in spark 2.
Basically these are divided depending on the availability of the table. Please refer to the link
回答6:
saveAsTable(String tableName)
Saves the content of the DataFrame as the specified table.
FYI -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html
来源:https://stackoverflow.com/questions/31487254/spark-what-options-can-be-passed-with-dataframe-saveastable-or-dataframewriter