Spark: Reading files using different delimiter than new line

前端 未结 5 1843
醉酒成梦
醉酒成梦 2020-11-28 08:01

I\'m using Apache Spark 1.0.1. I have many files delimited with UTF8 \\u0001 and not with the usual new line \\n. How can I read such files in Spar

5条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-28 08:27

    You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.mapreduce.Job
    import org.apache.hadoop.io.{LongWritable, Text}
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
    
    val conf = new Configuration(sc.hadoopConfiguration)
    conf.set("textinputformat.record.delimiter", "X")
    val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
    val lines = input.map { case (_, text) => text.toString}
    println(lines.collect)
    

    For example, my input is a file containing one line aXbXcXd. The above code will output

    Array(a, b, c, d)
    

提交回复
热议问题