rdd

Spark--大数据的“电光石火”

僤鯓⒐⒋嵵緔 提交于 2019-12-05 01:34:36
Spark已正式申请加入Apache孵化器,从灵机一闪的实验室“电火花”成长为大数据技术平台中异军突起的新锐。本文主要讲述Spark的设计思想。Spark如其名,展现了大数据不常见的“电光石火”。具体特点概括为“轻、快、灵和巧”。 轻 :Spark 0.6核心代码有2万行,Hadoop 1.0为9万行,2.0为22万行。一方面,感谢Scala语言的简洁和丰富表达力;另一方面,Spark很好地利用了Hadoop和Mesos(伯克利 另一个进入孵化器的项目,主攻集群的动态资源管理)的基础设施。虽然很轻,但在容错设计上不打折扣。主创人Matei声称:“不把错误当特例处理。”言下 之意,容错是基础设施的一部分。 快 :Spark对小数据集能达到亚秒级的延迟,这对于Hadoop MapReduce(以下简称MapReduce)是无法想象的(由于“心跳”间隔机制,仅任务启动就有数秒的延迟)。就大数据集而言,对典型的迭代机器 学习、即席查询(ad-hoc query)、图计算等应用,Spark版本比基于MapReduce、Hive和Pregel的实现快上十倍到百倍。其中内存计算、数据本地性 (locality)和传输优化、调度优化等该居首功,也与设计伊始即秉持的轻量理念不无关系。 灵 :Spark提供了不同层面的灵活性。在实现层,它完美演绎了Scala trait动态混入(mixin)策略

Spark学习(2)

半腔热情 提交于 2019-12-05 00:34:54
什么是RDD RDD(Resilient Distributed Dataset)叫做分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、弹性、里面的元素可并行计算的集合 RDD允许用户在执行多个查询时显式地将工作集缓存在内存中,后续的查询能够重用工作集,这极大地提升了查询速度 RDD支持两种操作:转化操作和行动操作 Spark采用惰性计算模式,RDD只有第一次在一个行动操作中用到时,才会真正计算 属性: 一组分区(Partition) 一个计算每个分区的函数 RDD之间的依赖关系 一个Partitioner 一个列表 移动数据不如移动计算 每个节点可以起一个或多个Executor。 每个Executor由若干core组成,每个Executor的每个core一次只能执行一个Task。 每个Task执行的结果就是生成了下一个RDD的一个partiton。 特点: 分区:RDD逻辑上是分区的,每个分区的数据是抽象存在的 只读:RDD是只读的,要想改变RDD中的数据,只能在现有的RDD基础上创建新的RDD 依赖:RDDs通过操作算子进行转换,转换得到的新RDD包含了从其他RDDs衍生所必需的信息,RDDs之间维护着这种血缘关系,也称之为依赖 缓存:如果在应用程序中多次使用同一个RDD,可以将该RDD缓存起来,这样就加速后期的重用 checkPoint

spark总结

爱⌒轻易说出口 提交于 2019-12-05 00:22:59
RDD及其特点 1、RDD是Spark的核心数据模型,但是个抽象类,全称为Resillient Distributed Dataset,即弹性分布式数据集。 2、RDD在抽象上来说是一种元素集合,包含了数据。它是被分区的,分为多个分区,每个分区分布在集群中的不同节点上,从而让RDD中的数据可以被并行操作。(分布式数据集) 3、RDD通常通过Hadoop上的文件,即HDFS文件或者Hive表,来进行创建;有时也可以通过应用程序中的集合来创建。 4、RDD最重要的特性就是,提供了容错性,可以自动从节点失败中恢复过来。即如果某个节点上的RDDpartition,因为节点故障,导致数据丢了,那么RDD会自动通过自己的数据来源重新计算该partition。这一切对使用者是透明的。 5、RDD的数据默认情况下存放在内存中的,但是在内存资源不足时,Spark会自动将RDD数据写入磁盘。(弹性) 创建RDD 进行Spark核心编程的第一步就是创建一个初始的RDD。该RDD,通常就代表和包含了Spark应用程序的输入源数据。然后通过Spark Core提供的transformation算子,对该RDD进行转换,来获取其他的RDD。 Spark Core提供了三种创建RDD的方式: 1.使用程序中的集合创建RDD(主要用于测试) List<Integer> numbers = Arrays

Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

给你一囗甜甜゛ 提交于 2019-12-05 00:11:34
问题 Spark RDD's are constructed in immutable, fault tolerant and resilient manner. Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability? 回答1: It depends on what you mean when you talk about RDD . Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage. When data is processed we can no longer talk about

spark in python: creating an rdd by loading binary data with numpy.fromfile

我是研究僧i 提交于 2019-12-04 21:15:58
The spark python api currently has limited support for loading large binary data files, and so I tried to get numpy.fromfile to help me out. I first got a list of filenames I'd like to load, e.g.: In [9] filenames Out[9]: ['A0000.dat', 'A0001.dat', 'A0002.dat', 'A0003.dat', 'A0004.dat'] I can load these files without problems with a crude iterative unionization, for i in range(len(filenames)): rdd = sc.parallelize([np.fromfile(filenames[i], dtype="int16", count=-1, sep='')]) if i==0: allRdd = rdd; else: allRdd = allRdd.union(rdd); It would be great to load the files all at once, and into

How many partitions does Spark create when a file is loaded from S3 bucket?

末鹿安然 提交于 2019-12-04 20:47:29
问题 If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket? 回答1: See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() . Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play). Also, you don't get splits if your InputFormat is not splittable :) 回答2:

Spark Kafka Streaming CommitAsync Error [duplicate]

偶尔善良 提交于 2019-12-04 20:42:42
This question already has an answer here: Exception while accessing KafkaOffset from RDD 1 answer I am new to Scala and RDD concept. Reading message from kafka using Kafka stream api in Spark and trying to commit after business work. but I am getting error. Note: Using repartition for Parallel work How to read offset from stream APi and commit it to Kafka ? scalaVersion := "2.11.8" val sparkVersion = "2.2.0" val connectorVersion = "2.0.7" val kafka_stream_version = "1.6.3" Code val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) ssc.checkpoint("C:/Gnana/cp") val kafkaStream = { val

How to specify only particular fields using read.schema in JSON : SPARK Scala

谁说我不能喝 提交于 2019-12-04 20:02:29
I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting data frame. But I am not sure how do I map it to the input. Can some give me some reference on how do I

Dropping the first and last row of an RDD with Spark

*爱你&永不变心* 提交于 2019-12-04 20:01:20
I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible? One way of doing this would be to zipWithIndex , and then filter out the records with indices 0 and count - 1 : // We're going to perform multiple actions on this RDD, // so it's usually better to cache it so we don't read the file twice rdd.cache() // Unfortunately, we have to count() to be able to identify the last

spark项目技术点整理

不羁的心 提交于 2019-12-04 18:01:41
spark项目技术点整理 1.性能调优:    1>分配更多的资源: 性能调优的王道就是分配和增加更多的资源。写完一个spark作业后第一个要是调节最优的资源配置,能够分配的资源达到你的能力范围的顶端后,才是考虑以后的性能调优。    2>分配那些资源: executor,cpu per executor,memory per executor.,driver memory    3>在哪里分配: 在提交spark作业时,用spark脚本,里面调整参数    /usr/local/spark/bin/spark-submit \   --class.cn.spark.sparktest.core.WordCountCluster \   --num-executor 3 \   --driver-memory 100m \   --executor-memory 100m \   --executor-core \   /usr/localSparkTest-0.0.1SNAPHOT-jar-with-dependencies.jar \    4>调节多大合适:   spark standalone:根据公司集群配置,如每台可以提供4G内存,2个cpu core;20台机器:一个作业同时提交:executor:20,4g内存,2个cpu core ,平均每个executor。