rdd | 易学教程

Spark Accumulator value not read by task

阅读更多关于 Spark Accumulator value not read by task

I am initializing an accumulator final Accumulator<Integer> accum = sc.accumulator(0); And then while in map function , I'm trying to increment the accumulator , then using the accumulator value in setting a variable. JavaRDD<UserSetGet> UserProfileRDD1 = temp.map(new Function<String, UserSetGet>() { @Override public UserSetGet call(String arg0) throws Exception { UserSetGet usg = new UserSetGet(); accum.add(1); usg.setPid(accum.value().toString(); } }); But Im getting the following error. 16/03/14 09:12:58 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang

How to perform Standard Deviation and Mean operations on a Java Spark RDD?

阅读更多关于 How to perform Standard Deviation and Mean operations on a Java Spark RDD?

I have a JavaRDD which looks like this., [ [A,8] [B,3] [C,5] [A,2] [B,8] ... ... ] I want my result to be Mean [ [A,5] [B,5.5] [C,5] ] How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames. Here you go : import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.util.StatCounter; import scala.Tuple2; import scala.Tuple3; import java.util.Arrays; import java.util.List; public class

spark学习记录

阅读更多关于 spark学习记录

mapreduce的限制适合“一趟”计算操作很难组合和嵌套操作符号无法表示迭代操作 ======== 由于复制、序列化和磁盘IO导致mapreduce慢复杂的应用、流计算、内部查询都因为maprecude缺少有效的数据共享而变慢 ====== 迭代操作每一次复制都需要磁盘IO 内部查询和在线处理都需要磁盘IO ========spark的目标在内存中保存更多的数据来提升性能扩展maprecude模型来更好支持两个常见的分析应用：1，迭代算法（机器学习、图）2，内部数据挖掘增强可编码性：1，多api库，2更少的代码 ====== spark组成 spark sql，spark straming（real-time），graphx，mllib（meachine learning） ====== 可以使用一下几种模式来运行：在它的standalone cluster mode下在hadoop yarn 在apache mesos 在kubernetes 活着在云上面 ========== 数据来源： 1，本地文件file:///opt/httpd/logs/access_log 2，amazon S3 3，hadooop distributed filesystem 4，hbase,cassandra,etc =========== spark 集群cluster ==

Getting error in Spark: Executor lost

阅读更多关于 Getting error in Spark: Executor lost

I have one master and two slaves each running on 32 GB of RAM and I'm reading a csv file with around 18 million records (the first row are the headers for the columns). This is the command I am using to run the job ./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file> I did the following rdd = sc.textFile("<path/to/file>") h = rdd.first() header_rdd = rdd.map(lambda l: h in l) data_rdd = rdd.subtract(header_rdd) data_rdd.first() I'm getting the following error message - 15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:

二、SparkCore

阅读更多关于二、SparkCore

第1章 RDD概述 1.1 什么是RDD RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据（计算）抽象。代码中是一个抽象类，它代表一个不可变、可分区、里面的元素可并行计算的集合。分布式：数据的来源数据集：数据的类型&计算类型的封装（数据模型）弹性：不可变：计算逻辑不可变可分区：提高数据处理能力并行计算：多任务同时执行 1.2 RDD的属性一组分区（Partition），即数据集的基本组成单位; 一个计算每个分区的函数; RDD之间的依赖关系; 一个Partitioner，即RDD的分片函数; 一个列表，存储存取每个Partition的优先位置（preferred location）。 1.3 RDD特点 RDD表示只读的分区的数据集，对RDD进行改动，只能通过RDD的转换操作，由一个RDD得到一个新的RDD，新的RDD包含了从其他RDD衍生所必需的信息。RDDs之间存在依赖，RDD的执行是按照血缘关系延时计算的。如果血缘关系较长，可以通过持久化RDD来切断血缘关系。 1.3.1 分区 RDD逻辑上是分区的，每个分区的数据是抽象存在的，计算的时候会通过一个 compute 函数得到每个分区的数据**。如果RDD是通过已有的文件系统构建，则compute函数是读取指定文件系统中的数据，

How access individual element in a tuple on a RDD in pyspark?

阅读更多关于 How access individual element in a tuple on a RDD in pyspark?

Lets say I have a RDD like [(u'Some1', (u'ABC', 9989)), (u'Some2', (u'XYZ', 235)), (u'Some3', (u'BBB', 5379)), (u'Some4', (u'ABC', 5379))] I am using map to get one tuple at a time but how can I access to individual element of a tuple like to see if a tuple contains some character. Actually I want to filter out those that contains some character. Here the tuples that contain ABC I was trying to do something like this but its not helping def foo(line): if(line[1]=="ABC"): return (line) new_data = data.map(foo) I am new to spark and python as well please help!! RDDs can be filtered directly.

Huge memory consumption in Map Task in Spark

阅读更多关于 Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n I walk through my files one by one and also want to build one output file per input file. Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same. After grouping up all of the lines, I give them to my Java parser. The parser then will contain all parsed data objects in memory and output it as JSON afterwards. To visualize

Recursive method call in Apache Spark

阅读更多关于 Recursive method call in Apache Spark

I'm building a family tree from a database on Apache Spark, using a recursive search to find the ultimate parent (ie the person at the top of the family tree) for each person in the DB. For the purposes of this, it's assumed that the first person returned when searching for their id is the correct parent val peopleById = peopleRDD.keyBy(f => f.id) def findUltimateParentId(personId: String) : String = { if((personId == null) || (personId.length() == 0)) return "-1" val personSeq = peopleById.lookup(personId) val person = personSeq(0) if(person.personId == "0 "|| person.id == person.parentId) {

Spark throws java.io.IOException: Failed to rename when saving part-xxxxx.gz

阅读更多关于 Spark throws java.io.IOException: Failed to rename when saving part-xxxxx.gz

New Spark user here. I'm extracting features from many .tif images stored on AWS S3, each with identifier like 02_R4_C7. I'm using Spark 2.2.1 and hadoop 2.7.2. I'm using all default configurations like so: conf = SparkConf().setAppName("Feature Extraction") sc = SparkContext(conf=conf) sc.setLogLevel("ERROR") sqlContext = SQLContext(sc) And here is the function call that this fails on after some features are successfully saved in an image id folder as part-xxxx.gz files: features_labels_rdd.saveAsTextFile(text_rdd_direct,"org.apache.hadoop.io.compress.GzipCodec") See error below. When I

Can anyone explain about rdd blocks in executors

阅读更多关于 Can anyone explain about rdd blocks in executors

Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks. I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks. Check out the code at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala val rddBlocks = status.numBlocks And if you go to the below link of Apache Spark Repo on Github: https://github.com/apache/spark

订阅 rdd