How to save a partitioned parquet file in Spark 2.1?

≡放荡痞女 提交于 2019-12-03 12:48:33

Interesting since...well..."it works for me".

As you describe your dataset using SimpleTest case class in Spark 2.1 you're import spark.implicits._ away to have a typed Dataset.

In my case, spark is sql.

In other words, you don't have to create testDataP and testDf (using sql.createDataFrame).

import spark.implicits._
...
val testDf = testData.toDS
testDf.write.partitionBy("id", "key").parquet("/path/to/file")

In another terminal (after saving to /tmp/testDf directory):

$ tree /tmp/testDf/
/tmp/testDf/
├── _SUCCESS
├── id=simple
│   ├── key=1
│   │   └── part-00003-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet
│   ├── key=2
│   │   └── part-00004-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet
│   └── key=3
│       └── part-00005-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet
└── id=test
    ├── key=1
    │   └── part-00000-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet
    ├── key=2
    │   └── part-00001-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet
    └── key=3
        └── part-00002-35212fd3-44cf-4091-9968-d9e2e05e5ac6.c000.snappy.parquet

8 directories, 7 files

I found a solution! According to Cloudera, is a mapred-site.xml configuration problem (check link below). Also, instead of writing the dataframe as: testDf.write.partitionBy("id", "key").parquet("/path/to/file")

I did it as follows: testDf.write.partitionBy("id", "key").parquet("hdfs://<namenode>:<port>/path/to/file"). You can substitute <namenode> and <port> with the HDFS' masternode name and port, respectively.

Special thanks to @jacek-laskowski, for his valuable contribution.

References:

https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/MKDirs-failed-to-create-file/m-p/36363#M1090

Writing to HDFS in Spark/Scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!