How to rename my JSON generated by pyspark?

我们两清 提交于 2020-12-25 05:48:02

问题


When i write my JSON file with

dataframe.coalesce(1).write.format('json')

on pyspark im not able to change the name of file in the partition

Im writing my JSON like that:

dataframe.coalesce(1).write.format('json').mode('overwrite').save('path')

but im not able to change the name of file in the partition

I want the path like that:

/folder/my_name.json

where 'my_name.json' is a json file


回答1:


In spark we can't control name of the file written to the directory.

First write the data to the HDFS directory then For changing the name of file we need to use HDFS api.

Example:

In Pyspark:

l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)

hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename

df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)

fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

#list files in the directory

list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))

#filter name of the file starts with part-

file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]

#rename the file

fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))

In case if you want to delete success files in the directory use fs.delete to delete _Success files.

In Scala:

val df=Seq(("a",1)).toDF("id","sa")
df.show(false)

import org.apache.hadoop.fs._

val hdfs_dir = "/folder/"
val new_filename="new_json.json"

df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)

val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString

fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))

fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))


来源:https://stackoverflow.com/questions/58171365/how-to-rename-my-json-generated-by-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!