rdd

reduceByKey: How does it work internally?

跟風遠走 提交于 2019-11-26 19:02:00
问题 I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code: val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b) The map function is clear: s is the key and it points to the line from data.txt and 1 is the value. However, I didn't get how the reduceByKey works internally? Does "a" points to the key? Alternatively, does "a" point to "s"? Then what does

What is the difference between Spark DataSet and RDD

三世轮回 提交于 2019-11-26 18:24:36
问题 I'm still struggling to understand the full power of the recently introduced Spark Datasets. Are there best practices of when to use RDDs and when to use Datasets? In their announcement Databricks explains that by using Datasets staggering reductions in both runtime and memory can be achieved. Still it is claimed that Datasets are designed ''to work alongside the existing RDD API''. Is this just a reference to downward compatibility or are there scenarios where one would prefer to use RDDs

PySpark DataFrames - way to enumerate without converting to Pandas?

心不动则不痛 提交于 2019-11-26 17:55:43
I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just indexes=[2,3,6,7] df[indexes] Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: Enumerating all the objects in the original dataframe by: indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes) Searching for values I need using where() function. QUESTIONS: Why it doesn't work and how to make it

Save a spark RDD to the local file system using Java

情到浓时终转凉″ 提交于 2019-11-26 17:51:30
问题 I have a RDD that is generated using Spark. Now if I write this RDD to a csv file, I am provided with some methods like "saveAsTextFile()" which outputs a csv file to the HDFS. I want to write the file to my local file system so that my SSIS process can pick the files from the system and load them into the DB. I am currently unable to use sqoop. Is it somewhere possible in Java other than writing shell scripts to do that. Any clarity needed, please let know. 回答1: saveAsTextFile is able to

Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

南楼画角 提交于 2019-11-26 17:45:40
问题 The Apache Spark pyspark.RDD API docs mention that groupByKey() is inefficient. Instead, it is recommend to use reduceByKey() , aggregateByKey() , combineByKey() , or foldByKey() instead. This will result in doing some of the aggregation in the workers prior to the shuffle, thus reducing shuffling of data across workers. Given the following data set and groupByKey() expression, what is an equivalent and efficient implementation (reduced cross-worker data shuffling) that does not utilize

Stackoverflow due to long RDD Lineage

谁都会走 提交于 2019-11-26 17:38:39
I have thousands of small files in HDFS. Need to process a slightly smaller subset of files (which is again in thousands), fileList contains list of filepaths which need to be processed. // fileList == list of filepaths in HDFS var masterRDD: org.apache.spark.rdd.RDD[(String, String)] = sparkContext.emptyRDD for (i <- 0 to fileList.size() - 1) { val filePath = fileStatus.get(i) val fileRDD = sparkContext.textFile(filePath) val sampleRDD = fileRDD.filter(line => line.startsWith("#####")).map(line => (filePath, line)) masterRDD = masterRDD.union(sampleRDD) } masterRDD.first() //Once out of loop,

How to find spark RDD/Dataframe size?

为君一笑 提交于 2019-11-26 17:35:36
问题 I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? Scala: object Main extends App { val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString() println(file.length) } Spark: val distFile = sc.textFile(file) println(distFile.length) but if i process it not getting file size. How to find the RDD size? 回答1: If you are simply looking to count the number of rows in the rdd , do: val distFile = sc.textFile(file) println(distFile.count) If you

Spark ALS predictAll returns empty

旧街凉风 提交于 2019-11-26 17:07:51
问题 I have the following Python test code (the arguments to ALS.train are defined elsewhere): r1 = (2, 1) r2 = (3, 1) test = sc.parallelize([r1, r2]) model = ALS.train(ratings, rank, numIter, lmbda) predictions = model.predictAll(test) print test.take(1) print predictions.count() print predictions Which works, because it has a count of 1 against the predictions variable and outputs: [(2, 1)] 1 ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423 However, when I try and use an RDD I

Why is the fold action necessary in Spark?

谁说胖子不能爱 提交于 2019-11-26 16:19:09
问题 I've a silly question involving fold and reduce in PySpark . I understand the difference between these two methods, but, if both need that the applied function is a commutative monoid, I cannot figure out an example in which fold cannot be substituted by reduce`. Besides, in the PySpark implementation of fold it is used acc = op(obj, acc) , why this operation order is used instead of acc = op(acc, obj) ? (this second order sounds more closed to a leftFold to me) Cheers Tomas 回答1: Empty RDD It

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

帅比萌擦擦* 提交于 2019-11-26 16:14:47
问题 Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner? 回答1: No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala: override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_ <: Product2[K, _]] => if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new