How rename S3 files not HDFS in spark scala

99封情书 提交于 2020-01-05 05:32:06

问题


I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.

How can i do that in spark-scala ?

I am looking for some sample code .

I am using zeppelin to run my spark script .

Below code I have tried as suggested from answer

import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)

But getting below error

<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
       val fs = Path.getFileSystem(conf)

回答1:


you can use the normal HDFS APIs, something like (typed in, not tested)

val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)

The way the S3A client fakes a rename is a copy + delete of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".

You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working



来源:https://stackoverflow.com/questions/48200035/how-rename-s3-files-not-hdfs-in-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!