rdd | 易学教程

Mapping of elements gone bad

阅读更多关于 Mapping of elements gone bad

问题 I am implementing k-means and I want to create the new centroids. But the mapping leaves one element out! However, when K is of a smaller value, like 15, it will work fine. Based on that code I have: val K = 25 // number of clusters val data = sc.textFile("dense.txt").map( t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache() val count = data.count() println("Number of records " + count) var centroids = data.takeSample(false, K, 42).map(x => x._2) do { var closest = data.map(p =>

Can anyone explain about rdd blocks in executors

阅读更多关于 Can anyone explain about rdd blocks in executors

问题 Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks. 回答1: I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks. Check out the code at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala val rddBlocks = status

Update collection in MongoDb via Apache Spark using Mongo-Hadoop connector

阅读更多关于 Update collection in MongoDb via Apache Spark using Mongo-Hadoop connector

问题 I would like to update a specific collection in MongoDb via Spark in Java. I am using the MongoDB Connector for Hadoop to retrieve and save information from Apache Spark to MongoDb in Java. After following Sampo Niskanen's excellent post regarding retrieving and saving collections to MongoDb via Spark, I got stuck with updating collections. MongoOutputFormat.java includes a constructor taking String[] updateKeys, which I am guessing is referring to a possible list of keys to compare on

How to add a new column to a Spark RDD?

阅读更多关于 How to add a new column to a Spark RDD?

问题 I have a RDD with MANY columns (e.g., hundreds ), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758 how can I add a column to it, whose value is the sum of the second and the third columns? Thank you very much. 回答1: You do not have to use Tuple * objects at all for adding a new column to an RDD . It can be done by mapping

RDD 初识

阅读更多关于 RDD 初识

RDD(Resilent Distributed Datasets)俗称弹性分布式数据集,是 Spark 底层的分布式存储的数据结构,可以说是 Spark 的核心, Spark API 的所有操作都是基于 RDD 的. 数据不只存储在一台机器上,而是分布在多台机器上,实现数据计算的并行化.弹性表明数据丢失时,可以进行重建.在Spark 1.5版以后,新增了数据结构 Spark-DataFrame,仿造的 R 和 python 的类 SQL 结构-DataFrame, 底层为 RDD, 能够让数据从业人员更好的操作 RDD.　　在Spark 的设计思想中,为了减少网络及磁盘 IO 开销,需要设计出一种新的容错方式,于是才诞生了新的数据结构 RDD. RDD 是一种只读的数据块,可以从外部数据转换而来,你可以对RDD 进行函数操作(Operation),包括 Transformation 和 Action. 在这里只读表示当你对一个 RDD 进行了操作,那么结果将会是一个新的 RDD, 这种情况放在代码里,假设变换前后都是使用同一个变量表示这一 RDD,RDD 里面的数据并不是真实的数据,而是一些元数据信息,记录了该 RDD 是通过哪些 Transformation 得到的,在计算机中使用 lineage 来表示这种血缘结构,lineage 形成一个有向无环图 DAG, 整个计算过程中

Recursive method call in Apache Spark

阅读更多关于 Recursive method call in Apache Spark

问题 I'm building a family tree from a database on Apache Spark, using a recursive search to find the ultimate parent (ie the person at the top of the family tree) for each person in the DB. For the purposes of this, it's assumed that the first person returned when searching for their id is the correct parent val peopleById = peopleRDD.keyBy(f => f.id) def findUltimateParentId(personId: String) : String = { if((personId == null) || (personId.length() == 0)) return "-1" val personSeq = peopleById

How can I count the average from Spark RDD?

阅读更多关于 How can I count the average from Spark RDD?

问题 I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this, [(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)] I want to count them like this, [(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)] then,get the result like this, [(2,120),(3,204),(4,160)] How can I do this with scala from RDD? I use spark version 1.6 回答1: you can use aggregateByKey. val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3

Convert a simple one line string to RDD in Spark

阅读更多关于 Convert a simple one line string to RDD in Spark

问题 I have a simple line: line = "Hello, world" I would like to convert it to an RDD with only one element. I have tried sc.parallelize(line) But it get: sc.parallelize(line).collect() ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd'] Any ideas? 回答1: try using List as parameter: sc.parallelize(List(line)).collect() it returns res1: Array[String] = Array(hello,world) 回答2: The below code works fine in Python sc.parallelize([line]).collect() ['Hello, world'] Here we are passing the

RDD监控

阅读更多关于 RDD监控

目录 1、Spark UI 2、Spark History UI 3、REST API 工作中需要监控Spark作业的运行情况，发现问题，来进行调优。 Monitoring and Instrumentation 监控指标： 1）Launtime 启动时间 2）Duration 持续时间 3）GC Time 垃圾收集时间 4）Shuffle Read Size/Record等监控Spark 应用程序的有三种方式： 1）Spark UI 2）Spark History UI 3）REST API 1、Spark UI 地址： http://hadoop001:4040 Spark UI界面标签： 1）Jobs：提交的job、stage信息、DAG图等。 2）Stages：stage信息、task信息。 3）Storage：数据存储进内存的信息。 4）Environment：环境、配置、jar等信息。 5）Executors：executors 、driver相关信息。问题： 1）如果在同一个机器上运行了多个sc，Spark UI端口是依次递增的：4040、4041、4042.... 2）确定：Spark程序运行结束之后就不能再看这个UI了，因为生命周期结束了，所以应该启动Spark History 服务。 2、Spark History UI 需要将Spark的操作记录下来

Spark JSON text field to RDD

阅读更多关于 Spark JSON text field to RDD

问题 I've got a cassandra table with a field of type text named snapshot containing JSON objects: [identifier, timestamp, snapshot] I understood that to be able to do transformations on that field with Spark, I need to convert that field of that RDD to another RDD to make transformations on the JSON schema. Is that correct? How should I proceed to to that? Edit: For now I managed to create an RDD from a single text field: val conf = new SparkConf().setAppName("signal-aggregation") val sc = new