rdd | 易学教程

Spark RDD mapping one row of data into multiple rows

阅读更多关于 Spark RDD mapping one row of data into multiple rows

问题 I have a text file with data that look like this: Type1 1 3 5 9 Type2 4 6 7 8 Type3 3 6 9 10 11 25 I'd like to transform it into an RDD with rows like this: 1 Type1 3 Type1 3 Type3 ...... I started with a case class: MyData[uid : Int, gid : String] New to spark and scala, and I can't seem to find an example that does this. 回答1: It seems you want something like this? rdd.flatMap(line=>{ val splitLine = line.split(' ').toList splitLine match{ case (gid:String) :: rest => rest.map(x:String =

Spark RDD partition by key in exclusive way

阅读更多关于 Spark RDD partition by key in exclusive way

问题 I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102) , the RDD should have 2 empty partitions and 100 partitions containing each one a single key value. I tried with groupByKey(k).repartition(102) but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty. Is

Spark DataFrame column names not passed to slave nodes?

阅读更多关于 Spark DataFrame column names not passed to slave nodes?

问题 I'm applying a function, lets say f(), via the map method to rows of a DataFrame (call it df ) but I see a NullPointerException when calling collect on resulting RDD if df.columns is passed as an argument to f(). The following Scala code, which can be pasted inside a spark-shell, shows a minimal example of the issue (see function prepRDD_buggy() ). I've also posted my current workaround for this issue in the function prepRDD() where the only difference that column names are passed as a val

Convert JSON objects to RDD

阅读更多关于 Convert JSON objects to RDD

问题 I don't know if this question is a repetition but somehow all the answers I came across don't seem to work for me (maybe I'm doing something wrong). I have a class defined thus: case class myRec( time: String, client_title: String, made_on_behalf: Double, country: String, email_address: String, phone: String) and a sample Json file that contains records or objects in the form [{...}{...}{...}...] i.e [{"time": "2015-05-01 02:25:47", "client_title": "Mr.", "made_on_behalf": 0, "country":

How to use RDD in other RDDs map method?

阅读更多关于 How to use RDD in other RDDs map method?

问题 I got a rdd named index: RDD[(String, String)], I want to use index to deal with my file. This is the code: val get = file.map({x => val tmp = index.lookup(x).head tmp }) The question is that I can not use index in the file.map Function, I ran this program and it gave me feedback like this: 14/12/11 16:22:27 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 602, spark2): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:770) com.ynu.App$$anonfun$12

How to resolve scala.MatchError when creating a Data Frame

阅读更多关于 How to resolve scala.MatchError when creating a Data Frame

问题 I have text file which has complex structured row. I am using customer converter which converts the given string(line) to Pojo class(countryInfo). After converting, I am building DF. The POJO class has a field which is a List of Custome Type(GlobalizedPlayTimeWindows). I created a Struct which matches this GlobalizedPlayTimeWindows and trying to convert the existing Custom Type to the Struct but keep getting error. StructType I created : import org.apache.spark.sql.types._ val PlayTimeWindow

How to convert RDD to DataFrame in Spark Streaming, not just Spark

阅读更多关于 How to convert RDD to DataFrame in Spark Streaming, not just Spark

问题 How can I convert RDD to DataFrame in Spark Streaming , not just Spark ? I saw this example, but it requires SparkContext . val sqlContext = new SQLContext(sc) import sqlContext.implicits._ rdd.toDF() In my case I have StreamingContext . Should I then create SparkContext inside foreach ? It looks too crazy... So, how to deal with this issue? My final goal (if it might be useful) is to save the DataFrame in Amazon S3 using rdd.toDF.write.format("json").saveAsTextFile("s3://iiiii/ttttt.json");

5分钟图解《Spark快速大数据分析》步骤6：RDD基本概念精炼版

阅读更多关于 5分钟图解《Spark快速大数据分析》步骤6：RDD基本概念精炼版

第1步：RDD是什么？ RDD其实就是一个分布式的元素集合。作为一个数据集合，它感觉起来跟Array、List等集合差不多，只不过它复杂一些，这些集合中的数据，是分布在不同的电脑主机上的。第2步：白话RDD计算流程（Spark Shell版）： 1、进入WindowsDOS 命令行（开始--->运行--->cmd） 2、启动Spark shell。（Spark shell是一个典型Spark内设驱动器程序Driver Program，它可以发起各种RDD操作） 3、Spark shell 默认创建了一个SparkContext对象，我们叫它sc。 4、输入xxxxxx（代码）按回车键，RDD进行转换操作（Transformation）。 5、输入xxxxxx（代码）按回车键，RDD进行行动操作（Action）。 6、屏幕显示出最终计算后的数据集合。比如：Array(xxx,xxx,xxx) 第3步：Spark RDD计算流程图如下第4步：RDD有哪些要点（白话版）： 1、RDD计算过程中，Spark Shell创建的SparkContext对象（sc）开始与以下两个模块进行交互：（1）集群管理器Cluster Manager，比较经典的有Yarn，Mesos等。（2）工作节点Worker Node里的执行器Executor。英文版中文版 2、惰性求值：我们不应该把

Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated

阅读更多关于 Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated

问题 I am learning how to use spark and I have a simple program.When I run the jar file it gives me the right result but I have some error in the stderr file.just like this: 15/05/18 18:19:52 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@localhost:51976] -> [akka.tcp://sparkDriver@172.31.34.148:60060] disassociated! Shutting down. 15/05/18 18:19:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@172.31.34

How to specify only particular fields using read.schema in JSON : SPARK Scala

阅读更多关于 How to specify only particular fields using read.schema in JSON : SPARK Scala

问题 I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting