rdd

Spark : DB connection per Spark RDD partition and do mapPartition

删除回忆录丶 提交于 2019-11-27 22:37:46
I want to do a mapPartitions on my spark rdd, val newRd = myRdd.mapPartitions( partition => { val connection = new DbConnection /*creates a db connection per partition*/ val newPartition = partition.map( record => { readMatchingFromDB(record, connection) }) connection.close() newPartition }) But, this gives me a connection already closed exception, as expected because before the control reaches the .map() my connection is closed. I want to create a connection per RDD partition, and close it properly. How can I achieve this? Thanks! Tzach Zohar As mentioned in the discussion here - the issue

58、DStream的output操作以及foreachRDD详解

核能气质少年 提交于 2019-11-27 21:17:55
一、output操作 1、output操作 DStream中的所有计算,都是由output操作触发的,比如print()。如果没有任何output操作,那么,压根儿就不会执行定义的计算逻辑。 此外,即使你使用了foreachRDD output操作,也必须在里面对RDD执行action操作,才能触发对每一个batch的计算逻辑。否则,光有 foreachRDD output操作,在里面没有对RDD执行action操作,也不会触发任何逻辑。 2、output操作概览 二、foreachRDD 1、foreachRDD详解 通常在foreachRDD中,都会创建一个Connection,比如JDBC Connection,然后通过Connection将数据写入外部存储。 误区一:在RDD的foreach操作外部,创建Connection 这种方式是错误的,因为它会导致Connection对象被序列化后传输到每个Task中。而这种Connection对象,实际上一般是不支持序列化的,也就无法被传输。 dstream.foreachRDD { rdd => val connection = createNewConnection() rdd.foreach { record => connection.send(record) } } 误区二:在RDD的foreach操作内部

How to get a sample with an exact sample size in Spark RDD?

為{幸葍}努か 提交于 2019-11-27 20:43:55
Why does the rdd.sample() function on Spark RDD return a different number of elements even though the fraction parameter is the same? For example, if my code is like below: val a = sc.parallelize(1 to 10000, 3) a.sample(false, 0.1).count Every time I run the second line of the code it returns a different number not equal to 1000. Actually I expect to see 1000 every time although the 1000 elements might be different. Can anyone tell me how I can get a sample with the sample size exactly equal to 1000? Thank you very much. If you want an exact sample, try doing a.takeSample(false, 1000) But note

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

你离开我真会死。 提交于 2019-11-27 20:28:15
We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance. Yes, Apache Spark will unpersist the RDD when it's garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected.

How do I select a range of elements in Spark RDD?

南笙酒味 提交于 2019-11-27 20:06:24
I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that? I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index. aaronman I don't think there is an efficient method to do this yet. But the easy way is using filter() , lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do. val

How to calculate the best numberOfPartitions for coalesce?

二次信任 提交于 2019-11-27 19:29:19
So, I understand that in general one should use coalesce() when: the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset. I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes ( idealPartionionNo ). I am working on a project which was passed to me from another engineer and he was using the below

SPARK总结之RDD

安稳与你 提交于 2019-11-27 19:25:01
一、RDD的概述 1.1 什么是RDD? RDD(Resilient Distributed Dataset)叫做 弹性分布式数据集 , 是Spark中最基本的数据抽象 ,它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点:自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中,后续的查询能够重用工作集,这极大地提升了查询速度。 1.2 RDD的属性 (1)一组分片(Partition),即数据集的基本组成单位。对于RDD来说,每个分片都会被一个计算任务处理,并决定并行计算的粒度。用户可以在创建RDD时指定RDD的分片个数,如果没有指定,那么就会采用默认值。默认值就是程序所分配到的CPU Core的数目。 (2)一个计算每个分区的函数。Spark中RDD的计算是以分片为单位的,每个RDD都会实现compute函数以达到这个目的。compute函数会对迭代器进行复合,不需要保存每次计算的结果。 (3)RDD之间的依赖关系。RDD的每次转换都会生成一个新的RDD,所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时,Spark可以通过这个依赖关系重新计算丢失的分区数据,而不是对RDD的所有分区进行重新计算。 (4)一个Partitioner,即RDD的分片函数

How to check if Spark RDD is in memory?

别等时光非礼了梦想. 提交于 2019-11-27 18:13:12
问题 I have an instance of org.apache.spark.rdd.RDD[MyClass]. How can I programmatically check if the instance is persist\inmemory? 回答1: You want RDD.getStorageLevel . It will return StorageLevel.None if empty. However that is only if it is marked for caching or not. If you want the actual status you can use the developer api sc.getRDDStorageInfo or sc.getPersistentRDD 回答2: You can call rdd.getStorageLevel.useMemory to check if it is in memory or not as follows: scala> myrdd.getStorageLevel

Spark parquet partitioning : Large number of files

自作多情 提交于 2019-11-27 18:01:28
I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all

reduceByKey: How does it work internally?

微笑、不失礼 提交于 2019-11-27 17:41:23
I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code: val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b) The map function is clear: s is the key and it points to the line from data.txt and 1 is the value. However, I didn't get how the reduceByKey works internally? Does "a" points to the key? Alternatively, does "a" point to "s"? Then what does represent a + b? how are they filled? Justin Pihony Let's break it down to discrete methods and types. That