rdd

Spark Accumulator value not read by task

我的未来我决定 提交于 2019-12-04 17:13:44
I am initializing an accumulator final Accumulator<Integer> accum = sc.accumulator(0); And then while in map function , I'm trying to increment the accumulator , then using the accumulator value in setting a variable. JavaRDD<UserSetGet> UserProfileRDD1 = temp.map(new Function<String, UserSetGet>() { @Override public UserSetGet call(String arg0) throws Exception { UserSetGet usg = new UserSetGet(); accum.add(1); usg.setPid(accum.value().toString(); } }); But Im getting the following error. 16/03/14 09:12:58 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang

How to perform Standard Deviation and Mean operations on a Java Spark RDD?

柔情痞子 提交于 2019-12-04 16:54:36
I have a JavaRDD which looks like this., [ [A,8] [B,3] [C,5] [A,2] [B,8] ... ... ] I want my result to be Mean [ [A,5] [B,5.5] [C,5] ] How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames. Here you go : import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.util.StatCounter; import scala.Tuple2; import scala.Tuple3; import java.util.Arrays; import java.util.List; public class

spark学习记录

五迷三道 提交于 2019-12-04 15:38:12
mapreduce的限制 适合“一趟”计算操作 很难组合和嵌套操作符号 无法表示迭代操作 ======== 由于复制、序列化和磁盘IO导致mapreduce慢 复杂的应用、流计算、内部查询都因为maprecude缺少有效的数据共享而变慢 ====== 迭代操作每一次复制都需要磁盘IO 内部查询和在线处理都需要磁盘IO ========spark的目标 在内存中保存更多的数据来提升性能 扩展maprecude模型来更好支持两个常见的分析应用:1,迭代算法(机器学习、图)2,内部数据挖掘 增强可编码性:1,多api库,2更少的代码 ====== spark组成 spark sql,spark straming(real-time),graphx,mllib(meachine learning) ====== 可以使用一下几种模式来运行: 在它的standalone cluster mode下 在hadoop yarn 在apache mesos 在kubernetes 活着在云上面 ========== 数据来源: 1,本地文件file:///opt/httpd/logs/access_log 2,amazon S3 3,hadooop distributed filesystem 4,hbase,cassandra,etc =========== spark 集群cluster ==

Getting error in Spark: Executor lost

独自空忆成欢 提交于 2019-12-04 15:32:06
I have one master and two slaves each running on 32 GB of RAM and I'm reading a csv file with around 18 million records (the first row are the headers for the columns). This is the command I am using to run the job ./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file> I did the following rdd = sc.textFile("<path/to/file>") h = rdd.first() header_rdd = rdd.map(lambda l: h in l) data_rdd = rdd.subtract(header_rdd) data_rdd.first() I'm getting the following error message - 15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:

二、SparkCore

安稳与你 提交于 2019-12-04 13:56:12
第1章 RDD概述 1.1 什么是RDD RDD(Resilient Distributed Dataset)叫做弹性分布式 数据集 ,是Spark中最基本的数据(计算)抽象。代码中是一个抽象类,它代表一个不可变、可分区、里面的元素可并行计算的集合。 分布式:数据的来源 数据集:数据的类型&计算类型的封装(数据模型) 弹性: 不可变:计算逻辑不可变 可分区:提高数据处理能力 并行计算:多任务同时执行 1.2 RDD的属性 一组分区(Partition),即数据集的基本组成单位; 一个计算每个分区的函数; RDD之间的依赖关系; 一个Partitioner,即RDD的分片函数; 一个列表,存储存取每个Partition的优先位置(preferred location)。 1.3 RDD特点 RDD表示 只读 的分区的数据集,对RDD进行改动,只能通过RDD的转换操作,由一个RDD得到一个新的RDD,新的RDD包含了从其他RDD衍生所必需的信息。RDDs之间存在依赖,RDD的执行是按照血缘关系 延时计算 的。如果血缘关系较长,可以通过持久化RDD来切断血缘关系。 1.3.1 分区 RDD逻辑上是分区的,每个分区的数据是抽象存在的,计算的时候会通过一个 compute 函数得到每个分区的数据**。 如果RDD是通过已有的文件系统构建,则compute函数是 读取指定文件系统中的数据 ,

How access individual element in a tuple on a RDD in pyspark?

守給你的承諾、 提交于 2019-12-04 13:48:51
Lets say I have a RDD like [(u'Some1', (u'ABC', 9989)), (u'Some2', (u'XYZ', 235)), (u'Some3', (u'BBB', 5379)), (u'Some4', (u'ABC', 5379))] I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC I was trying to do something like this but its not helping def foo(line): if(line[1]=="ABC"): return (line) new_data = data.map(foo) I am new to spark and python as well please help!! RDDs can be filtered directly.

Huge memory consumption in Map Task in Spark

ぐ巨炮叔叔 提交于 2019-12-04 12:52:39
I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n I walk through my files one by one and also want to build one output file per input file. Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same. After grouping up all of the lines, I give them to my Java parser. The parser then will contain all parsed data objects in memory and output it as JSON afterwards. To visualize

Recursive method call in Apache Spark

时光总嘲笑我的痴心妄想 提交于 2019-12-04 12:52:34
I'm building a family tree from a database on Apache Spark, using a recursive search to find the ultimate parent (ie the person at the top of the family tree) for each person in the DB. For the purposes of this, it's assumed that the first person returned when searching for their id is the correct parent val peopleById = peopleRDD.keyBy(f => f.id) def findUltimateParentId(personId: String) : String = { if((personId == null) || (personId.length() == 0)) return "-1" val personSeq = peopleById.lookup(personId) val person = personSeq(0) if(person.personId == "0 "|| person.id == person.parentId) {

Spark throws java.io.IOException: Failed to rename when saving part-xxxxx.gz

喜欢而已 提交于 2019-12-04 12:51:53
New Spark user here. I'm extracting features from many .tif images stored on AWS S3, each with identifier like 02_R4_C7. I'm using Spark 2.2.1 and hadoop 2.7.2. I'm using all default configurations like so: conf = SparkConf().setAppName("Feature Extraction") sc = SparkContext(conf=conf) sc.setLogLevel("ERROR") sqlContext = SQLContext(sc) And here is the function call that this fails on after some features are successfully saved in an image id folder as part-xxxx.gz files: features_labels_rdd.saveAsTextFile(text_rdd_direct,"org.apache.hadoop.io.compress.GzipCodec") See error below. When I

Can anyone explain about rdd blocks in executors

半城伤御伤魂 提交于 2019-12-04 12:15:04
Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks. I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks. Check out the code at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala val rddBlocks = status.numBlocks And if you go to the below link of Apache Spark Repo on Github: https://github.com/apache/spark