How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

匿名 (未验证) 提交于 2019-12-03 00:50:01

问题:

I am able to write it into

  • ORC
  • PARQUET

    directly and

  • TEXTFILE

  • AVRO

using additional dependencies from databricks.

    <dependency>         <groupId>com.databricks</groupId>         <artifactId>spark-csv_2.10</artifactId>         <version>1.5.0</version>     </dependency>     <dependency>         <groupId>com.databricks</groupId>         <artifactId>spark-avro_2.10</artifactId>         <version>2.0.1</version>     </dependency>

Sample code:

    SparkContext sc = new SparkContext(conf);     HiveContext hc = new HiveContext(sc);     DataFrame df = hc.table(hiveTableName);     df.printSchema();     DataFrameWriter writer = df.repartition(1).write();      if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {         writer.orc(outputHdfsFile);      } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {         writer.parquet(outputHdfsFile);      } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {         writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);      } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {         writer.format("com.databricks.spark.avro").save(outputHdfsFile);     }

Is there any way to write dataframe into hadoop SequenceFile and RCFile?

回答1:

You can use void saveAsObjectFile(String path) to save a RDD as a SequenceFile of serialized objects. So in your case you have to to retrieve the RDD from the DataFrame:

JavaRDD<Row> rdd = df.javaRDD; rdd.saveAsObjectFile(outputHdfsFile);


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!