rdd

Why does RDD.foreach fail with “SparkException: This RDD lacks a SparkContext”?

馋奶兔 提交于 2019-11-29 12:26:24
I have a dataset (as an RDD ) that I divide into 4 RDDs by using different filter operators. val RSet = datasetRdd. flatMap(x => RSetForAttr(x, alLevel, hieDict)). map(x => (x, 1)). reduceByKey((x, y) => x + y) val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp")) val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc")) val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv")) val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv")) I sent Rp and RpSV to the following function calculateEntropy : def

Apache spark dealing with case statements

大城市里の小女人 提交于 2019-11-29 12:07:29
问题 I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the right approach? Please help! Basically I need to go through each line in the RDD or DF and based on some logic I need to edit one of the column values. case when (e."a" Like 'a%' Or e."b" Like 'b%') And e."aa"='BW' And cast(e."abc" as decimal(10,4))

How to partition Spark RDD when importing Postgres using JDBC?

不想你离开。 提交于 2019-11-29 11:24:24
I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order): df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load() df.printSchema() root |-- id: string (nullable = false) |-- timestamp: timestamp (nullable = false) |-- key: string (nullable = false) |-- value: double (nullable = false) Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition

Initialize an RDD to empty

醉酒当歌 提交于 2019-11-29 11:17:50
问题 I have an RDD called JavaPairRDD<String, List<String>> existingRDD; Now I need to initialize this existingRDD to empty so that when I get the actual rdd's I can do a union with this existingRDD . How do I initialize existingRDD to an empty RDD except initializing it to null? Here is my code: JavaPairRDD<String, List<String>> existingRDD; if(ai.get()%10==0) { existingRDD.saveAsNewAPIHadoopFile("s3://manthan-impala-test/kinesis-dump/" + startTime + "/" + k + "/" + System.currentTimeMillis() + "

RDD to LabeledPoint conversion

有些话、适合烂在心里 提交于 2019-11-29 10:56:51
If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such: val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) Any suggestions or guidance will be much appreciated. Maybe I messed up RDD with DataFrame, I can convert the RDD to

reduceByKey method not being found in Scala Spark

烈酒焚心 提交于 2019-11-29 09:05:46
Attempting to run http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala from source. This line: val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) is throwing error value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)] val wordCounts = logData.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) logData.flatMap(line => line.split(" ")).map(word => (word, 1)) returns a MappedRDD but I cannot find this type in http://spark.apache.org/docs/0.9.1/api/core

spark笔记之DStream

空扰寡人 提交于 2019-11-29 07:24:10
3.1 什么是DStream Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark算子操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图: 对数据的操作也是按照RDD为单位来进行的 Spark Streaming使用数据源产生的数据流创建DStream,也可以在已有的DStream上使用一些操作来创建新的DStream。 它的工作流程像下面的图所示一样,接受到实时数据后,给数据分批次,然后传给Spark Engine处理最后生成该批次的结果。 来源: https://blog.51cto.com/14473726/2435677

Not able to declare String type accumulator

孤人 提交于 2019-11-29 07:23:19
I am trying to define an accumulator variable of type String in Scala shell (driver) but I keep getting the following error:- scala> val myacc = sc.accumulator("Test") <console>:21: error: could not find implicit value for parameter param: org.apache.spark.AccumulatorParam[String] val myacc = sc.accumulator("Test") ^ This seems to be no issue for Int or Double type of accumulator. Thanks That's because Spark by default provides only accumulators of type Long , Double and Float . If you need something else you have to extend AccumulatorParam . import org.apache.spark.AccumulatorParam object

spark基础认识笔记

此生再无相见时 提交于 2019-11-29 06:54:54
1.Spark Apache Spark是一种快速通用的集群计算系统。 它提供Java,Scala,Python和R中的高级API,以及支持通用执行图的优化引擎。 它还支持一组丰富的高级工具,包括用于SQL和结构化数据处理的Spark SQL,用于机器学习的MLlib,用于图形处理的GraphX和Spark Streaming。 2.基本概念 3.RDD及其RDD任务划分原理 图中共展示了A、B、C、D、E、F、G一共7个RDD。每个RDD中的小方块代表一个分区,将会有一个Task处理此分区的数据。RDD A经过groupByKey转换后得到RDD B。RDD C经过map转换后得到RDD D。RDD D和RDD E经过union转换后得到RDD F。RDD B和RDD F经过join转换后得到RDD G。从图中可以看到map和union生成的RDD与其上游RDD之间的依赖是NarrowDependency,而groupByKey和join生成的RDD与其上游的RDD之间的依赖是ShuffleDependency。由于DAGScheduler按照ShuffleDependency作为Stage的划分的依据,因此A被划入了ShuffleMapStage 1;C、D、E、F被划入了ShuffleMapStage 2;B和G被划入了ResultStage 3

SPARK之分区器

烈酒焚心 提交于 2019-11-29 05:36:12
Spark目前支持Hash分区和Range分区,用户也可以自定义分区,Hash分区为当前的默认分区,Spark中分区器直接决定了RDD中分区的个数、RDD中每条数据经过Shuffle过程属于哪个分区和Reduce的个数 只有Key-Value类型的RDD才有分区器的,非Key-Value类型的RDD分区器的值是None 每个RDD的分区ID范围:0~numPartitions-1,决定这个值是属于那个分区的 来源: https://www.cnblogs.com/xiangyuguan/p/11456801.html