rdd

Is it possible to create nested RDDs in Apache Spark?

阅读更多关于 Is it possible to create nested RDDs in Apache Spark?

问题 I am trying to implement K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDD's. This will make my life a lot easier. Consider the following code snippet. public static void main (String[] args){ //blah blah code JavaRDD<Double> temp1 = testData.map( new Function<Vector,Double>(){ public Double call(final Vector z) throws Exception{ JavaRDD<Double> temp2 = trainData.map( new Function<Vector, Double>() { public Double call(Vector vector) throws

How to read from hbase using spark

阅读更多关于 How to read from hbase using spark

The below code will read from the hbase, then convert it to json structure and the convert to schemaRDD , But the problem is that I am using List to store the json string then pass to javaRDD, for data of about 100 GB the master will be loaded with data in memory. What is the right way to load the data from hbase then perform manipulation,then convert to JavaRDD. package hbase_reader; import java.io.IOException; import java.io.Serializable; import java.util.ArrayList; import java.util.List; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org

Apache Spark RDD filter into two RDDs

阅读更多关于 Apache Spark RDD filter into two RDDs

问题 I need to split an RDD into 2 parts: 1 part which satisfies a condition; another part which does not. I can do filter twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature. 回答1: Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick. If it's really just two different types, you can use a helper method:

Default Partitioning Scheme in Spark

阅读更多关于 Default Partitioning Scheme in Spark

When I execute below command: scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist() rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22 scala> rdd.partitions.size res9: Int = 10 scala> rdd.partitioner.isDefined res10: Boolean = true scala> rdd.partitioner.get res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a It says that there are 10 partitions and partitioning is done using HashPartitioner . But When I execute below command: scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6))

Spark core 之 RDD操作

阅读更多关于 Spark core 之 RDD操作

RDD 中操作分类两大类型：转换( transformation )和行动( action ) 转换：通过操作将一个RDD转换成另外一个RDD。行动：将一个RDD进行求值或者输出。所有这些操作主要针对两种类型的RDD： (1) 数值RDD (2) 键值对RDD 我们用的最多的就是键值对RDD，然后引起一些比如说数据的不平衡，这个也就是键值对RDD引起的。 RDD的所有转换操作都是懒执行的，只有当行动操作出现的时候Spark才会去真的运行，也就是说不管你前面定义了多少RDD的转换，你在做这件事的时候它是不执行的，只有你在触发RDD的action操作的时候，它所有你前面定义的东西，它才会全部执行，那为什么这么做呢？其实spark是为了做优化，就是在你定义的这条路径里面，就是不是所有人都能够定义的非常优化的一些操作，这个时候 spark就会去为你分析你的执行路径是否是最优的，如果不是最优的，它会帮你去把它做到最优化，因为你前面都还没有触发，所以说spark可以决定哪些东西能触发，哪些东西不能触发，比如说你前面定义了很多的转换操作，但是最后你没有设置一个行动操作，OK，那前面这些步骤对spark来说都是无用的，也就是说我不用动这些东西，那么这个就是RDD的操作。转换操作：是从一个RDD转换到另外一个RDD，也就是说它的输出结果是另外一个RDD。行动操作

How do I get a SQL row_number equivalent for a Spark RDD?

阅读更多关于 How do I get a SQL row_number equivalent for a Spark RDD?

I need to generate a full list of row_numbers for a data table with many columns. In SQL, this would look like this: select key_value, col1, col2, col3, row_number() over (partition by key_value order by col1, col2 desc, col3) from temp ; Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like (key1, (1,2,3)) (key1, (1,4,7)) (key1, (2,2,3)) (key2, (5,5,5)) (key2, (5,5,9)) (key2, (7,5,5)) etc. I want to order these using commands like sortBy(), sortWith(), sortByKey(), zipWithIndex, etc. and have a new RDD with the correct row_number (key1,

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

阅读更多关于 Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

问题 We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance. 回答1: Yes, Apache Spark will unpersist the RDD when it's garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a

Matrix Multiplication in Apache Spark [closed]

阅读更多关于 Matrix Multiplication in Apache Spark [closed]

I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs? All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed . At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg

阅读更多关于 RDD

一、简述　　RDD：核心抽象　　一个RDD在抽象上代表一个hdfs文件，　　分布式数据集：元素集合包含数据，实际上是被分区的，分为多个分区散落在spark集群中的不同节点（一批节点上的一批数据就是RDD）。　　最重要特性：提供了容错性，节点失败中自动恢复。默认放在内存，内存不够，被写入磁盘。二、RDD的创建第一种创建方式：从文件中加载 >>> lines = sc.textFile("file:///usr/local/spark/mycode/pairrdd/word.txt") >>> pairRDD = lines.flatMap(lambda line : line.split(" ")).map(lambda word : (word,1)) >>> pairRDD.foreach(print) 第二种创建方式：通过并行集合（列表）创建RDD >>> list = ["Hadoop","Spark","Hive","Spark"] >>> rdd = sc.parallelize(list) >>> pairRDD = rdd.map(lambda word : (word,1)) >>> pairRDD.foreach(print) 三、常用键值对转换 >>> pairRDD.reduceByKey(lambda a,b : a+b).foreach

How do you perform basic joins of two RDD tables in Spark using Python?

阅读更多关于 How do you perform basic joins of two RDD tables in Spark using Python?

问题 How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for: Inner Join Left Outer Join Cross Join With two tables (RDD) with a single column in each that has a common key. RDD(1):(key,U) RDD(2):(key,V) I think an inner join is something like this: rdd1.join(rdd2).map(case (key, u, v) => (key, ls ++ rs)); Is that right? I have searched the internet and can't find a good example of joins. Thanks in advance. 回答1: It