How to read a file from HDFS in map() quickly with Spark

后端 未结 1 377
无人及你
无人及你 2020-12-18 11:37

I need to read a different file in every map() ,the file is in HDFS

  val rdd=sc.parallelize(1 to 10000)
  val rdd2=rdd.map{x=>
    val hdfs = org.apache.         


        
相关标签:
1条回答
  • 2020-12-18 12:04

    In your case, I recommend the use of wholeTextFiles method wich will return pairRdd with the key is the file full path, and the value is the content of the file in string.

    val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/")
    val filesLineCount = filesPariRDD.map( x => (x._1, x._2.length ) ) //this will return a map of fileName , number of lines of each file. You could apply any other function on the file contents
    filesLineCount.collect() 
    

    Edit

    If your files are in directories which are under the same directory ( as mentioned in comments)you could use some kind of regular expression

    val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/*/")
    

    Hope this is clear and helpful

    0 讨论(0)
提交回复
热议问题