问题
I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.
How can i do that in spark-scala ?
I am looking for some sample code .
I am using zeppelin to run my spark script .
Below code I have tried as suggested from answer
import org.apache.hadoop.fs._
val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)
But getting below error
<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
val fs = Path.getFileSystem(conf)
回答1:
you can use the normal HDFS APIs, something like (typed in, not tested)
val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)
The way the S3A client fakes a rename is a copy + delete
of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".
You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working
来源:https://stackoverflow.com/questions/48200035/how-rename-s3-files-not-hdfs-in-spark-scala