Spark sorting of delimited data

纵饮孤独 提交于 2019-12-24 04:33:53

问题


I am new to Spark. Can you give any idea what is the problem with below code:

val rawData="""USA | E001 | ABC DE | 19850607 | IT | $100
UK | E005 | CHAN CL | 19870512 | OP | $200
USA | E003 | XYZ AB | 19890101 | IT | $250
USA | E002 | XYZ AB | 19890705 | IT | $200"""
val sc = ...     
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
val data1=rdd.flatMap(line=> line.split(" | "))
val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
data2.saveAsTextFile("./sample_data1_output")

Here, .sortByKey(false) is not working and compiler gives me error:

[error] /home/admin/scala/spark-poc/src/main/scala/SparkApp.scala:26: value sortByKey is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false) 

Question is how to get MappedRDD? Or on what object should I call sortByKey()?


回答1:


Spark provides additional operations, like sortByKey(), on RDDs of pairs. These operations are available through a class called PairRDDFunctions and Spark uses implicit conversions to automatically perform the RDD -> PairRDDFunctions wrapping.

To import the implicit conversions, add the following lines to the top of your program:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

This is discussed in the Spark programming guide's section on Working with key-value pairs.



来源:https://stackoverflow.com/questions/24685162/spark-sorting-of-delimited-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!