rdd | 易学教程

spark算子总结

阅读更多关于 spark算子总结

一：转换算子 1:Value类型 1.1 map(func) 1. 作用：返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成 2. 需求：创建一个1-10数组的RDD，将所有元素*2形成新的RDD （1）创建 scala> var source = sc.parallelize(1 to 10) source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24 （2）打印 scala> source.collect() res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) （3）将所有元素*2 scala> val mapadd = source.map(_ * 2) mapadd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:26 （4）打印最终结果 scala> mapadd.collect() res8: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) 1.2 mapPartitions(func) 1. 作用

How to map filenames to RDD using sc.textFile(“s3n://bucket/*.csv”)?

阅读更多关于 How to map filenames to RDD using sc.textFile(“s3n://bucket/*.csv”)?

问题 Please note, I must use the sc.textFile, but I would accept any other answers. What I want to do is to simply add the filename that is being processed to RDD.... some thing like: var rdd = sc.textFile("s3n://bucket/*.csv").map(line=>filename+","+line) Much appreciated! EDIT2: SOLUTION TO EDIT1 is to use Hadoop 2.4 or above. However, I have not tested it by using the slaves... etc. However, some of the mentioned solutions work only for the small data-sets. If you want to use the big-data, you

Spark- Saving JavaRDD to Cassandra

阅读更多关于 Spark- Saving JavaRDD to Cassandra

问题 http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java The link above shows a way to save a JavaRDD to cassandra in this way: import static com.datastax.spark.connector.CassandraJavaUtil.*; JavaRDD<Product> productsRDD = sc.parallelize(products); javaFunctions(productsRDD, Product.class).saveToCassandra("java_api", "products"); But the com.datastax.spark.connector.CassandraJavaUtil.* seems deprecated. The updated API should be: import static com.datastax.spark.connector.japi

How to turn a known structured RDD to Vector

阅读更多关于 How to turn a known structured RDD to Vector

问题 Assuming I have an RDD containing (Int, Int) tuples. I wish to turn it into a Vector where first Int in tuple is the index and second is the value. Any Idea how can I do that? I update my question and add my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector in order to update a single accumulator instead of multiple accumulators. There for my final solution was: reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => { val v =

How to name file when saveAsTextFile in spark?

阅读更多关于 How to name file when saveAsTextFile in spark?

问题 When saving as a textfile in spark version 1.5.1 I use: rdd.saveAsTextFile('<drectory>') . But if I want to find the file in that direcotry, how do I name it what I want? Currently, I think it is named part-00000 , which must be some default. How do I give it a name? 回答1: As I said in my comment above, the documentation with examples can be found here. And quoting the description of the method saveAsTextFile : Save this RDD as a text file, using string representations of elements. In the

Kafka Spark Stream throws Exception:No current assignment for partition

阅读更多关于 Kafka Spark Stream throws Exception:No current assignment for partition

问题 Below is my scala code to create spark kafka stream: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "server110:2181,server110:9092", "zookeeper" -> "server110:2181", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "example", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("ABTest") val stream = KafkaUtils.createDirectStream[String, String]( ssc,

Combining files

阅读更多关于 Combining files

问题 I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set. The complete data file is of the format(userID,MovID,Rating,Timestamp): res8: Array[String] = Array(1, 31, 2.5, 1260759144) The test data file is of the format(userID,MovID): res10: Array[String] = Array(1, 1172) How do I generate ratings_train that will not have the caes

Spark - Sort Double values in an RDD and ignore NaNs

阅读更多关于 Spark - Sort Double values in an RDD and ignore NaNs

问题 I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values. Either the Double.NaN values should appear at the bottom or top of the sorted RDD. I was not able to achieve this using sortBy. scala> res13.sortBy(r => r, ascending = true) res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26 scala> res21.collect.foreach(println) 0.656 0.99 0.998 1.0 NaN 5.6 7.0 scala> res13.sortBy(r => r, ascending = false) res23: org

How can I group pairRDD by keys and turn the values into RDD

阅读更多关于 How can I group pairRDD by keys and turn the values into RDD

问题 So what I have is RDD[(String, Int)] and I need to convert it into Map[String, RDD[Int]] Ex. My input looks like this: RDD[("a", 1), ("a", 2), ("b", 1), ("c", 3)] And the output I'm trying to get is: Map["a" -> RDD[1, 2], "b" -> RDD[1], "c" -> RDD[3]] Thanks in advance! 来源： https://stackoverflow.com/questions/48402479/how-can-i-group-pairrdd-by-keys-and-turn-the-values-into-rdd

How does lineage get passed down in RDDs in Apache Spark

阅读更多关于 How does lineage get passed down in RDDs in Apache Spark

问题 Do each RDD point to the same lineage graph or when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive? 回答1: Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b just keeps a reference (and never copies) to its parent a ,