rdd

Spark:使用工作集的集群计算

主宰稳场 提交于 2019-11-26 16:09:35
文章目录 摘要 1. 简介 2. 编程模型 2.1 弹性分布式数据集(RDD) 2.2 并行操作 2.3 共享变量 3. 示例 3.1 文本搜索 3.2 逻辑回归 3.3 交替的最小二乘 4. 实现 4.1 共享变量 4.2 编译器集成 5. 成果 5.1 逻辑回归 5.2 交替的最小二乘 5.3 交互式Spark 6. 相关工作 6.1 分布式共享内存 6.2 集群计算框架 6.3 语言集成 6.4 血缘 7. 讨论及未来的工作 8. 致谢 参考文章 摘要 MapReduce及其变体在商业集群上实现大规模数据密集型应用方面非常成功。但是,大多数这些系统都是围绕非迭代数据流模型构建的,而这种模型不适合其他流行的应用程序。本文重点介绍这样一类应用:在多个并行操作中复用一个工作集数据的应用。这包括许多迭代机器学习算法,以及交互式数据分析工具。我们提出了一个支持这类应用的名为Spark的新框架,同时保留MapReduce的可伸缩性和容错性。为了实现这些目标,Spark引入了一种称为弹性分布式数据集(RDD)的抽象。RDD是分区分布在一组机器上的一个只读对象的集合,如果一个分区数据丢失后可以重建。在迭代机器学习任务中,Spark的性能超过Hadoop 10倍,并且交互式查询39 GB数据集可以亚秒响应时间。 1. 简介 一种新的集群计算模型已经变得广泛流行

Spark performance for Scala vs Python

淺唱寂寞╮ 提交于 2019-11-26 15:36:01
I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle . Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string, boolean. I am using 6 cores out of 8 for Spark processing; that's why I used minPartitions=6 so that

Is groupByKey ever preferred over reduceByKey

五迷三道 提交于 2019-11-26 15:25:41
I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey , because I'm assuming that the performance of reduceByKey will never be worse than groupByKey . However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey should be preferred?? zero323 I believe there are other

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

梦想与她 提交于 2019-11-26 15:24:20
问题 What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession ? Is there any method to convert or create a Context using a SparkSession ? Can I completely replace all the Contexts using one single entry SparkSession ? Are all the functions in SQLContext , SparkContext , and JavaSparkContext also in SparkSession ? Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext . How do they behave in SparkSession ? How can I create the

Spark specify multiple column conditions for dataframe join

六月ゝ 毕业季﹏ 提交于 2019-11-26 15:18:26
问题 How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left") I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want. 回答1: There is a Spark column/expression API join for

Difference between DataFrame, Dataset, and RDD in Spark

风格不统一 提交于 2019-11-26 14:47:34
I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] ) in Apache Spark? Can you convert one to the other? Justin Pihony A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. An RDD , on the other hand,

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

[亡魂溺海] 提交于 2019-11-26 14:30:51
问题 This question already has answers here : Read whole text files from a compression in Spark (2 answers) Closed 3 years ago . I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files file1.json file2.json file3.json And these are contained in archive.tar.gz . I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc

How to sort an RDD in Scala Spark?

江枫思渺然 提交于 2019-11-26 14:04:15
问题 Reading Spark method sortByKey : sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there

Why does Spark RDD partition has 2GB limit for HDFS?

戏子无情 提交于 2019-11-26 13:52:23
问题 I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following, 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark

spark ml

不想你离开。 提交于 2019-11-26 13:51:05
spark ml是基于spark 2.0环境,以DataFrame为数据处理单元。spark经历了三代,依次如下。DataFrame是个列式数据集,结构化的数据集,RDD是非结构化的,第二代比第一代因结构化数据计算的性能都要优秀些。第三代的dataset已经序列化的 数据,是encoding,已经转化为二进制,也就是spark自己已实现编码和反编码。因此,其性能因不需要要第三方结构来处理数据得到进一步提升。RDD会逐步退出历史舞台。 DataFrame按列处理数据,不是面向对象风格,不进行安全检查,只有在运行的时候才进行安全检查。 DataSet必须明确每一个列,是个强类型,在编译的时候进行类型检查。用case class定义。 RDD的创建 创建示例 RDD转化为DataFram 如果出现错误,可以从checkpoint恢复,不需要重新跑一边程序。 转化为临时sql表 转化为临时表可以进行sql操作。 去重操作示例 Expr操作?? 分割操作示例 withcolumn是增加一列,增加常数项 聚合操作示例 对json的支持 时间日期操作 支持数值运算 字符串操作 来源: https://www.cnblogs.com/chenglansky/p/11934851.html