rdd | 易学教程

How to add a new column to a Spark RDD?

阅读更多关于 How to add a new column to a Spark RDD?

I have a RDD with MANY columns (e.g., hundreds ), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758 how can I add a column to it, whose value is the sum of the second and the third columns? Thank you very much. You do not have to use Tuple * objects at all for adding a new column to an RDD . It can be done by mapping each row, taking its original contents plus the elements you want to append, for example: val rdd = ... val

When to use Kryo serialization in Spark?

阅读更多关于 When to use Kryo serialization in Spark?

I am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER) . Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed? Tim Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that

Spark cache RDD don't show up on Spark History WebUI - Storage

阅读更多关于 Spark cache RDD don't show up on Spark History WebUI - Storage

问题 I am using Spark-1.4.1 in CDH-5.4.4 . I use rdd.cache() function but it show nothing in Storage tab on Spark History WebUI Does anyone has the same issues? How to fix it? 回答1: Your RDD will only be cached once its been evaluated, the most common way to force evaluation (and therefor populate the cache) is to call count e.g: rdd.cache() // Nothing in storage page yet & nothing cached rdd.count() // RDD evaluated, cached & in storage page. 来源： https://stackoverflow.com/questions/31715698/spark

Spark之RDD的定义及五大特性

阅读更多关于 Spark之RDD的定义及五大特性

　　RDD是分布式内存的一个抽象概念，是一种高度受限的共享内存模型，即RDD是只读的记录分区的集合，能横跨集群所有节点并行计算，是一种基于工作集的应用抽象。　　RDD底层存储原理：其数据分布存储于多台机器上，事实上，每个RDD的数据都以Block的形式存储于多台机器上，每个Executor会启动一个BlockManagerSlave，并管理一部分Block；而Block的元数据由Driver节点上的BlockManagerMaster保存，BlockManagerSlave生成Block后向BlockManagerMaster注册该Block，BlockManagerMaster管理RDD与Block的关系，当RDD不再需要存储的时候，将向BlockManagerSlave发送指令删除相应的Block。　　BlockManager管理RDD的物理分区，每个Block就是节点上对应的一个数据块，可以存储在内存或者磁盘上。而RDD中的Partition是一个逻辑数据块，对应相应的物理块Block。本质上，一个RDD在代码中相当于数据的一个元数据结构，存储着数据分区及其逻辑结构映射关系，存储着RDD之前的依赖转换关系。　　BlockManager在每个节点上运行管理Block(Driver和Executors)，它提供一个接口检索本地和远程的存储变量，如memory、disk

spark学习笔记(一)

阅读更多关于 spark学习笔记(一)

概括我们公司使用spark已经有段时间了，现在我对我之前的学习知识进行整理，以便记录和大家共同学习，有一部分是网上摘抄，感谢网络共享大神。本文只是针对spark2，spark基本概念，简而言之就是：spark专门为大规模数据处理而设计的快速通用的计算引擎，是apache的一个开源项目。是一种跟Hadoop相似的通用分布式并行计算框架，但是spark是基于内存计算的分布式执行框架，在执行速度上优于hadoop，并且提供了一个全面、统一的框架用于管理各种有着不同性质的数据集和数据源的大数据处理需求。 SPARK架构和生态 spark主要包括Spark Core和在Spark Core基础上建立的应用框架：数据分析引擎SparkSQL、图计算框架GraphX、机器学习库MLlib、流计算引擎Spark Streaming。Core库主要包括上下文（Spark Context）、数据抽象集（RDD、DataFrame和DataSet）、调度器（Scheduler）、洗牌（shuffle）和序列化器（Serializer）等。在Core库之上就根据业务需求分为用于交互式查询的SQL、实时流处理Streaming、机器学习Mllib和图计算GraphX四大框架，除此外还有一些其他实验性项目如Tachyon、BlinkDB和Tungsten等。Hdfs是Spark主要应用的持久化存储系统

In Apache Spark, Is it possible to specify partition's preferred location for a shuffled RDD or a cogrouped RDD?

阅读更多关于 In Apache Spark, Is it possible to specify partition's preferred location for a shuffled RDD or a cogrouped RDD?

问题 As for Spark 1.6+, the only API that supports customizing partition location is when the RDD is created: /** Distribute a local Scala collection to form an RDD, with one or more * location preferences (hostnames of Spark nodes) for each object. * Create a new partition for each collection item. */ def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] Despite being very useful in some cases (e.g. when RDD.compute() has to access some local resources, not just HDFS). This is the only

SPARK基本操作

阅读更多关于 SPARK基本操作

./bin/hdfs dfs -put /usr/local/spark/mycode/wordcount/word.txt . 我们把本地文件系统中的“/usr/local/spark/mycode/wordcount/word.txt”上传到分布式文件系统HDFS中（放到hadoop用户目录下）在pyspark窗口中，就可以使用下面任意一条命令完成从HDFS文件系统中加载数据： >>> lines = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt") >>> lines = sc.textFile("/user/hadoop/word.txt") >>> lines = sc.textFile("word.txt") 可以调用SparkContext的parallelize方法，在Driver中一个已经存在的集合（数组）上创建 >>> nums = [1,2,3,4,5] >>> rdd = sc.parallelize(nums) 2.启动spark集群启动Hadoop集群 cd /usr/local/hadoop/ sbin/start-all.sh 启动Spark的Master节点和所有slaves节点 cd /usr/local/spark/ sbin/start-master.sh sbin

PySpark: Map a SchemaRDD into a SchemaRDD

阅读更多关于 PySpark: Map a SchemaRDD into a SchemaRDD

问题 I am loading a file of JSON objects as a PySpark SchemaRDD . I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table. The problem I have is that the following returns a PipelinedRDD not a SchemaRDD : log_json.map(flatten_function) (Where log_json is a SchemaRDD ). Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type? 回答1: More an idea than a real solution. Let's assume your data

Spark: Not enough space to cache red in container while still a lot of total storage memory

阅读更多关于 Spark: Not enough space to cache red in container while still a lot of total storage memory

I have a 30 node cluster, each node has 32 core, 240 G memory (AWS cr1.8xlarge instance). I have the following configurations: --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 I can see from the job tracker that I still have a lot of total storage memory left, but in one of the containers, I got the following message saying Storage limit = 28.3 GB. I am wondering where does this 28.3 GB came from? My memoryFraction for storage is 0.45 And how do I solve this Not enough space to cache rdd issue? Should I do more partition or change default

spark streaming基础

阅读更多关于 spark streaming基础

前言 spark streaming在2.2.1版本之后出现一个类似的实时计算框架Structured Streaming。引用一句 spark streaming structured streaming区别博客的原话，建议扩展读下：Structured Streaming 通过提供一套 high-level 的 declarative api 使得流式计算的编写相比 Spark Streaming 简单容易不少，同时通过提供 end-to-end 的 exactly-once 语义。核心优势有以下几点：用流式计算代替batch计算，declarative api可以减少代码编写难度，可以保证exactly-once。一：StreamingContext详解两种创建方式：一：sparkConf方式 val conf = new SparkConf().setAppName(appName).setMaster(master); val ssc = new StreamingContext(conf, Seconds(1)); 二：sparkContext方式 val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(1)); 一个StreamingContext定义之后

订阅 rdd