rdd

spark算子总结

拈花ヽ惹草 提交于 2019-12-10 16:20:45
一:转换算子 1:Value类型 1.1 map(func) 1. 作用:返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 2. 需求:创建一个1-10数组的RDD,将所有元素*2形成新的RDD (1)创建 scala> var source = sc.parallelize(1 to 10) source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24 (2)打印 scala> source.collect() res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (3)将所有元素*2 scala> val mapadd = source.map(_ * 2) mapadd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:26 (4)打印最终结果 scala> mapadd.collect() res8: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) 1.2 mapPartitions(func) 1. 作用

How to map filenames to RDD using sc.textFile(“s3n://bucket/*.csv”)?

家住魔仙堡 提交于 2019-12-10 14:51:02
问题 Please note, I must use the sc.textFile, but I would accept any other answers. What I want to do is to simply add the filename that is being processed to RDD.... some thing like: var rdd = sc.textFile("s3n://bucket/*.csv").map(line=>filename+","+line) Much appreciated! EDIT2: SOLUTION TO EDIT1 is to use Hadoop 2.4 or above. However, I have not tested it by using the slaves... etc. However, some of the mentioned solutions work only for the small data-sets. If you want to use the big-data, you

Spark- Saving JavaRDD to Cassandra

风流意气都作罢 提交于 2019-12-10 14:47:13
问题 http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java The link above shows a way to save a JavaRDD to cassandra in this way: import static com.datastax.spark.connector.CassandraJavaUtil.*; JavaRDD<Product> productsRDD = sc.parallelize(products); javaFunctions(productsRDD, Product.class).saveToCassandra("java_api", "products"); But the com.datastax.spark.connector.CassandraJavaUtil.* seems deprecated. The updated API should be: import static com.datastax.spark.connector.japi

How to turn a known structured RDD to Vector

落爺英雄遲暮 提交于 2019-12-10 13:46:07
问题 Assuming I have an RDD containing (Int, Int) tuples. I wish to turn it into a Vector where first Int in tuple is the index and second is the value. Any Idea how can I do that? I update my question and add my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector in order to update a single accumulator instead of multiple accumulators. There for my final solution was: reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => { val v =

How to name file when saveAsTextFile in spark?

回眸只為那壹抹淺笑 提交于 2019-12-10 12:56:57
问题 When saving as a textfile in spark version 1.5.1 I use: rdd.saveAsTextFile('<drectory>') . But if I want to find the file in that direcotry, how do I name it what I want? Currently, I think it is named part-00000 , which must be some default. How do I give it a name? 回答1: As I said in my comment above, the documentation with examples can be found here. And quoting the description of the method saveAsTextFile : Save this RDD as a text file, using string representations of elements. In the

Kafka Spark Stream throws Exception:No current assignment for partition

我们两清 提交于 2019-12-10 12:12:50
问题 Below is my scala code to create spark kafka stream: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "server110:2181,server110:9092", "zookeeper" -> "server110:2181", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "example", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("ABTest") val stream = KafkaUtils.createDirectStream[String, String]( ssc,

Combining files

大兔子大兔子 提交于 2019-12-10 12:06:14
问题 I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set. The complete data file is of the format(userID,MovID,Rating,Timestamp): res8: Array[String] = Array(1, 31, 2.5, 1260759144) The test data file is of the format(userID,MovID): res10: Array[String] = Array(1, 1172) How do I generate ratings_train that will not have the caes

Spark - Sort Double values in an RDD and ignore NaNs

烈酒焚心 提交于 2019-12-10 11:55:50
问题 I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values. Either the Double.NaN values should appear at the bottom or top of the sorted RDD. I was not able to achieve this using sortBy. scala> res13.sortBy(r => r, ascending = true) res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26 scala> res21.collect.foreach(println) 0.656 0.99 0.998 1.0 NaN 5.6 7.0 scala> res13.sortBy(r => r, ascending = false) res23: org

How can I group pairRDD by keys and turn the values into RDD

核能气质少年 提交于 2019-12-10 11:48:15
问题 So what I have is RDD[(String, Int)] and I need to convert it into Map[String, RDD[Int]] Ex. My input looks like this: RDD[("a", 1), ("a", 2), ("b", 1), ("c", 3)] And the output I'm trying to get is: Map["a" -> RDD[1, 2], "b" -> RDD[1], "c" -> RDD[3]] Thanks in advance! 来源: https://stackoverflow.com/questions/48402479/how-can-i-group-pairrdd-by-keys-and-turn-the-values-into-rdd

How does lineage get passed down in RDDs in Apache Spark

夙愿已清 提交于 2019-12-10 11:04:55
问题 Do each RDD point to the same lineage graph or when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive? 回答1: Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b just keeps a reference (and never copies) to its parent a ,