rdd

Spark RDD mapping one row of data into multiple rows

吃可爱长大的小学妹 提交于 2019-12-22 10:59:58
问题 I have a text file with data that look like this: Type1 1 3 5 9 Type2 4 6 7 8 Type3 3 6 9 10 11 25 I'd like to transform it into an RDD with rows like this: 1 Type1 3 Type1 3 Type3 ...... I started with a case class: MyData[uid : Int, gid : String] New to spark and scala, and I can't seem to find an example that does this. 回答1: It seems you want something like this? rdd.flatMap(line=>{ val splitLine = line.split(' ').toList splitLine match{ case (gid:String) :: rest => rest.map(x:String =

Spark RDD partition by key in exclusive way

你说的曾经没有我的故事 提交于 2019-12-22 10:45:16
问题 I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102) , the RDD should have 2 empty partitions and 100 partitions containing each one a single key value. I tried with groupByKey(k).repartition(102) but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty. Is

Spark DataFrame column names not passed to slave nodes?

若如初见. 提交于 2019-12-22 08:27:48
问题 I'm applying a function, lets say f(), via the map method to rows of a DataFrame (call it df ) but I see a NullPointerException when calling collect on resulting RDD if df.columns is passed as an argument to f(). The following Scala code, which can be pasted inside a spark-shell, shows a minimal example of the issue (see function prepRDD_buggy() ). I've also posted my current workaround for this issue in the function prepRDD() where the only difference that column names are passed as a val

Convert JSON objects to RDD

元气小坏坏 提交于 2019-12-22 08:24:51
问题 I don't know if this question is a repetition but somehow all the answers I came across don't seem to work for me (maybe I'm doing something wrong). I have a class defined thus: case class myRec( time: String, client_title: String, made_on_behalf: Double, country: String, email_address: String, phone: String) and a sample Json file that contains records or objects in the form [{...}{...}{...}...] i.e [{"time": "2015-05-01 02:25:47", "client_title": "Mr.", "made_on_behalf": 0, "country":

How to use RDD in other RDDs map method?

旧巷老猫 提交于 2019-12-22 08:13:32
问题 I got a rdd named index: RDD[(String, String)], I want to use index to deal with my file. This is the code: val get = file.map({x => val tmp = index.lookup(x).head tmp }) The question is that I can not use index in the file.map Function, I ran this program and it gave me feedback like this: 14/12/11 16:22:27 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 602, spark2): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:770) com.ynu.App$$anonfun$12

How to resolve scala.MatchError when creating a Data Frame

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-22 08:12:46
问题 I have text file which has complex structured row. I am using customer converter which converts the given string(line) to Pojo class(countryInfo). After converting, I am building DF. The POJO class has a field which is a List of Custome Type(GlobalizedPlayTimeWindows). I created a Struct which matches this GlobalizedPlayTimeWindows and trying to convert the existing Custom Type to the Struct but keep getting error. StructType I created : import org.apache.spark.sql.types._ val PlayTimeWindow

How to convert RDD to DataFrame in Spark Streaming, not just Spark

时间秒杀一切 提交于 2019-12-22 06:35:06
问题 How can I convert RDD to DataFrame in Spark Streaming , not just Spark ? I saw this example, but it requires SparkContext . val sqlContext = new SQLContext(sc) import sqlContext.implicits._ rdd.toDF() In my case I have StreamingContext . Should I then create SparkContext inside foreach ? It looks too crazy... So, how to deal with this issue? My final goal (if it might be useful) is to save the DataFrame in Amazon S3 using rdd.toDF.write.format("json").saveAsTextFile("s3://iiiii/ttttt.json");

5分钟图解《Spark快速大数据分析》步骤6:RDD基本概念精炼版

末鹿安然 提交于 2019-12-22 05:11:41
第1步:RDD是什么? RDD其实就是一个分布式的元素集合。作为一个数据集合,它感觉起来跟Array、List等集合差不多,只不过它复杂一些,这些集合中的数据,是分布在不同的电脑主机上的。 第2步:白话RDD计算流程(Spark Shell版): 1、进入WindowsDOS 命令行(开始--->运行--->cmd) 2、启动Spark shell。(Spark shell是一个典型Spark内设驱动器程序Driver Program,它可以发起各种RDD操作) 3、Spark shell 默认创建了一个SparkContext对象,我们叫它sc。 4、输入xxxxxx(代码)按回车键,RDD进行转换操作(Transformation)。 5、输入xxxxxx(代码)按回车键,RDD进行行动操作(Action)。 6、屏幕显示出最终计算后的数据集合。比如:Array(xxx,xxx,xxx) 第3步:Spark RDD计算流程图如下 第4步:RDD有哪些要点(白话版): 1、RDD计算过程中,Spark Shell创建的SparkContext对象(sc)开始与以下两个模块进行交互: (1)集群管理器Cluster Manager,比较经典的有Yarn,Mesos等。 (2)工作节点Worker Node里的执行器Executor。 英文版 中文版 2、惰性求值:我们不应该把

Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated

冷暖自知 提交于 2019-12-22 00:48:44
问题 I am learning how to use spark and I have a simple program.When I run the jar file it gives me the right result but I have some error in the stderr file.just like this: 15/05/18 18:19:52 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@localhost:51976] -> [akka.tcp://sparkDriver@172.31.34.148:60060] disassociated! Shutting down. 15/05/18 18:19:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@172.31.34

How to specify only particular fields using read.schema in JSON : SPARK Scala

Deadly 提交于 2019-12-21 23:32:24
问题 I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting