How to process multi line input records in Spark

后端 未结 2 968
盖世英雄少女心
盖世英雄少女心 2020-11-28 12:00

I have each record spread across multiple lines in the input file(Very huge file).

Ex:

Id:   2
ASIN: 0738700123
  title: Test tile for this product
         


        
2条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-28 12:47

    If the multi-line data has a defined record separator, you could use the hadoop support for multi-line records, providing the separator through a hadoop.Configuration object:

    Something like this should do:

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.io.{LongWritable, Text}
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "id:")
    val dataset = sc.newAPIHadoopFile("/path/to/data", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
    val data = dataset.map(x=>x._2.toString)
    

    This will provide you with an RDD[String] where each element corresponds to a record. Afterwards you need to parse each record following your application requirements.

提交回复
热议问题