I am able to write it into
ORC
PARQUET
directly and
TEXTFILE
AVRO
using additional dependencies from databricks.
<dependency> <groupId>com.databricks</groupId> <artifactId>spark-csv_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>2.0.1</version> </dependency>
Sample code:
SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame df = hc.table(hiveTableName); df.printSchema(); DataFrameWriter writer = df.repartition(1).write(); if ("ORC".equalsIgnoreCase(hdfsFileFormat)) { writer.orc(outputHdfsFile); } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) { writer.parquet(outputHdfsFile); } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile); } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.avro").save(outputHdfsFile); }
Is there any way to write dataframe into hadoop SequenceFile and RCFile?