Spark: Reading files using different delimiter than new line

前端未结

关注

 5  1840

醉酒成梦 2020-11-28 08:01

I\'m using Apache Spark 1.0.1. I have many files delimited with UTF8 \\u0001 and not with the usual new line \\n. How can I read such files in Spar

5条回答

鱼传尺愫 (楼主)

2020-11-28 08:15

In python this could be achieved using:

rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])

0 讨论(0)

查看其它5个回答