How can one list all csv files in an HDFS location within the Spark Scala shell?

后端 未结 3 650
温柔的废话
温柔的废话 2021-01-05 06:27

The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using

RddName.coalesce(1).saveAsTex         


        
3条回答
  •  日久生厌
    2021-01-05 07:02

    This is what ultimately worked for me:

    import org.apache.hadoop.fs._
    import org.apache.spark.deploy.SparkHadoopUtil
    import java.net.URI
    
    val hdfs_conf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
    val hdfs = FileSystem.get(hdfs_conf)
    // source data in HDFS
    val sourcePath = new Path("//")
    
    hdfs.globStatus( sourcePath ).foreach{ fileStatus =>
       val filePathName = fileStatus.getPath().toString()
       val fileName = fileStatus.getPath().getName()
    
       // < DO STUFF HERE>
    
    } // end foreach loop
    

提交回复
热议问题