Spark: Reading files using different delimiter than new line

前端 未结 5 1822
醉酒成梦
醉酒成梦 2020-11-28 08:01

I\'m using Apache Spark 1.0.1. I have many files delimited with UTF8 \\u0001 and not with the usual new line \\n. How can I read such files in Spar

5条回答
  •  感情败类
    2020-11-28 08:07

    In Spark shell, I extracted data according to Setting textinputformat.record.delimiter in spark:

    $ spark-shell
    ...
    scala> import org.apache.hadoop.io.LongWritable
    import org.apache.hadoop.io.LongWritable
    
    scala> import org.apache.hadoop.io.Text
    import org.apache.hadoop.io.Text
    
    scala> import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.conf.Configuration
    
    scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
    
    scala> val conf = new Configuration
    conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
    
    scala> conf.set("textinputformat.record.delimiter", "\u0001")
    
    scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
    data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at :19
    

    sc.newAPIHadoopFile("mydata.txt", ...) is a RDD[(LongWritable, Text)], where the first part of the elements is the starting character index, and the second part is the actual text delimited by "\u0001".

提交回复
热议问题