How to know the file formats supported by Databricks?

坚强是说给别人听的谎言 提交于 2019-12-14 03:26:45

问题


I have a requirement to load various files (different type) into spark data frame. Are all these file formats supported by Databricks? If yes, where can I get the list of options supported for each file format?

delimited
csv
parquet
avro
excel
json

Thanks


回答1:


I don't know exactly what Databricks offers out of the box (pre-installed), but you can do some reverse-engineering using org.apache.spark.sql.execution.datasources.DataSource object that is (quoting the scaladoc):

The main class responsible for representing a pluggable Data Source in Spark SQL

All data sources usually register themselves using DataSourceRegister interface (and use shortName to provide their alias):

Data sources should implement this trait so that they can register an alias to their data source.

Reading along the scaladoc of DataSourceRegister you'll find out that:

This allows users to give the data source alias as the format type over the fully qualified class name.

So, YMMV.

Unless you find an authoritative answer on Databricks, you may want to (follow DataSource.lookupDataSource and) use Java's ServiceLoader.load method to find all registered implementations of DataSourceRegister interface.

// start a Spark application with external module with a separate DataSource
$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT

import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister

val formats = ServiceLoader.load(classOf[DataSourceRegister])

import scala.collection.JavaConverters._
scala> formats.asScala.map(_.shortName).foreach(println)
orc
hive
libsvm
csv
jdbc
json
parquet
text
console
socket
kafka

Where can I get the list of options supported for each file format?

That's not possible as there is no API to follow (like in Spark MLlib) to define options. Every format does this on its own...unfortunately and your best bet is to read the documentation or (more authoritative) the source code.




回答2:


All these formats are supported by Spark, for Excel files you can use spark-excel library.



来源:https://stackoverflow.com/questions/44300499/how-to-know-the-file-formats-supported-by-databricks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!