rdd

What is the difference between cache and persist?

荒凉一梦 提交于 2019-11-27 10:07:23
In terms of RDD persistence, what are the differences between cache() and persist() in spark ? ahars With cache() , you use only the default storage level MEMORY_ONLY . With persist() , you can specify which storage level you want,( rdd-persistence ). From the official docs: You can mark an RDD to be persisted using the persist () or cache () methods on it. each persisted RDD can be stored using a different storage level The cache () method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). Use persist() if you want to

What is RDD in spark

喜夏-厌秋 提交于 2019-11-27 09:40:51
问题 Definition says: RDD is immutable distributed collection of objects I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python) From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions: Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or

Save a spark RDD to the local file system using Java

五迷三道 提交于 2019-11-27 09:32:01
I have a RDD that is generated using Spark. Now if I write this RDD to a csv file, I am provided with some methods like "saveAsTextFile()" which outputs a csv file to the HDFS. I want to write the file to my local file system so that my SSIS process can pick the files from the system and load them into the DB. I am currently unable to use sqoop. Is it somewhere possible in Java other than writing shell scripts to do that. Any clarity needed, please let know. saveAsTextFile is able to take in local file system paths (e.g. file:///tmp/magic/... ). However, if your running on a distributed

How to control preferred locations of RDD partitions?

房东的猫 提交于 2019-11-27 09:09:54
Is there a way to set the preferred locations of RDD partitions manually? I want to make sure certain partition be computed in a certain machine. I'm using an array and the 'Parallelize' method to create a RDD from that. Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node. Is there a way to set the preferredLocations of RDD partitions manually? Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it. Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

南楼画角 提交于 2019-11-27 09:08:05
This question already has an answer here: Read whole text files from a compression in Spark 2 answers I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files file1.json file2.json file3.json And these are contained in archive.tar.gz . I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output. Is there some way to handle gzipped archives containing multiple

大数据笔记

老子叫甜甜 提交于 2019-11-27 08:44:06
1.Hadoop是什么?为什么要使用Hadoop?平常如何使用Hadoop完成工作? Hadoop是一个大数据开源框架。The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The project includes these modules: Hadoop Common : The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™) : A distributed file system that provides high-throughput access to application data. Hadoop YARN

Spark 系列(十四)—— Spark Streaming 基本操作

假如想象 提交于 2019-11-27 08:34:53
一、案例引入 这里先引入一个基本的案例来演示流的创建:获取指定端口上的数据并进行词频统计。项目依赖和代码实现如下: <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.12</artifactId> <version>2.4.3</version> </dependency> import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} object NetworkWordCount { def main(args: Array[String]) { /*指定时间间隔为 5s*/ val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(5)) /*创建文本输入流,并进行词频统计*/ val lines = ssc.socketTextStream("hadoop001", 9999) lines.flatMap(_.split(" "))

Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

北城余情 提交于 2019-11-27 08:27:41
The Apache Spark pyspark.RDD API docs mention that groupByKey() is inefficient. Instead, it is recommend to use reduceByKey() , aggregateByKey() , combineByKey() , or foldByKey() instead. This will result in doing some of the aggregation in the workers prior to the shuffle, thus reducing shuffling of data across workers. Given the following data set and groupByKey() expression, what is an equivalent and efficient implementation (reduced cross-worker data shuffling) that does not utilize groupByKey() , but delivers the same result? dataset = [("a", 7), ("b", 3), ("a", 8)] rdd = (sc.parallelize

Spark Scala当中reduceByKey的用法

做~自己de王妃 提交于 2019-11-27 08:16:43
[学习笔记] /*reduceByKey(function) reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行function的reduce操作(如前所述),因此,Key相同的多个元素的值被reduce为一个值,然后与原RDD中的Key组成一个新的KV对。 reduceByKey(_+_)是reduceByKey((x,y) => x+y)的一个 简洁的形式 */ val rdd08 = sc.parallelize(List((1, 1), (1, 4),(1, 3), (3, 7), (3, 5))) val rdd08_1 = rdd08.reduceByKey((x, y) => x + y) println("reduceByKey 用法 " + rdd08_1.collect().mkString(",")) sc.stop() } def myunion(rdd05: RDD[Int], rdd06: RDD[Int]): Unit = { val res: RDD[Int] = rdd05.union(rdd06) collect: 收集一个弹性分布式数据集的所有元素到一个数组中,这样便于我们观察,毕竟分布式数据集比较抽象。Spark的collect方法,是Action类型的一个算子,会从远程集群拉取数据到driver端。最后

How to sort an RDD in Scala Spark?

强颜欢笑 提交于 2019-11-27 08:13:24
Reading Spark method sortByKey : sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there a more efficient method ? Most likely you have already perused the source code: class