rdd

Spark入门01

一曲冷凌霜 提交于 2019-12-04 11:53:05
一, Spark概述 spark框架地址 1、官网: http://spark.apache.org/ 2、源码托管: https://github.com/apache/spark 3、母公司网站: https://databricks.com/ 官方博客:https://databricks.com/blog/、https://databricks.com/blog/category/engineering/spark 1,官方定义 http://spark.apache.org/docs/2.2.0/ Spark框架,类似于MapReduce框架,针对大规模数据分析框架。 2,大数据分析类型 离线处理:处理分析的数据是静态不变的,类似MapReduce和Hive框架等 交互式分析:即席查询,类似于impala 实时分析:针对流式数据实时处理,展示结果等 3,Spark框架介绍 在磁盘上对100TB的数据进行排序,可以看到Spark比hadoop快的多,效率高。 为什么Spark框架如此快? 数据结构 RDD:弹性分布式数据集,Spark将要处理的数据封装到集合RDD中,调用RDD中函数处理数据。 RDD数据可以放到内存中,内存不足可以放到磁盘中。 Task任务运行方式不一样 MapReduce应用运行:MapTask和ReduceTask都是JVM进程。启动一个jvm进程很慢

Return an RDD from takeOrdered, instead of a list

我们两清 提交于 2019-12-04 11:06:50
I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection: (self.spark_context.textFile(old_filepath+filename) .takeOrdered(100) .saveAsTextFile(new_filepath+filename)) My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work. AttributeError: 'list' object has no attribute 'saveAsTextFile' Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here. Isn't there any way to return an

sparksql基础知识一

柔情痞子 提交于 2019-12-04 09:24:36
目标 掌握sparksql底层原理 掌握sparksql中DataFrame和DataSet的数据结构和使用方式 掌握通过sparksql开发应用程序 要点 1.sparksql概述 1.1 sparksql的前世今生 Shark是专门针对于spark的构建大规模数据仓库系统的一个框架 Shark与Hive兼容、同时也依赖于Spark版本 Hivesql底层把sql解析成了mapreduce程序,Shark是把sql语句解析成了Spark任务 随着性能优化的上限,以及集成SQL的一些复杂的分析功能,发现Hive的MapReduce思想限制了Shark的发展。 最后Databricks公司终止对Shark的开发 决定单独开发一个框架,不在依赖hive,把重点转移到了==sparksql==这个框架上。 1.2 什么是sparksql Spark SQL is Apache Spark's module for working with structured data. SparkSQL是apache Spark用来处理结构化数据的一个模块 2. sparksql的四大特性 1、易整合 将SQL查询与Spark程序无缝混合 可以使用不同的语言进行代码开发 java scala python R 2、统一的数据源访问 以相同的方式连接到任何数据源

mapPartitions returns empty array

六眼飞鱼酱① 提交于 2019-12-04 08:59:25
I have the following RDD which has 4 partitions:- val rdd=sc.parallelize(1 to 20,4) Now I try to call mapPartitions on this:- scala> rdd.mapPartitions(x=> { println(x.size); x }).collect 5 5 5 5 res98: Array[Int] = Array() Why does it return empty array? The anonymoys function is simply returning the same iterator it received, then how is it returning empty array? The interesting part is that if I remove println statement, it indeed returns non empty array:- scala> rdd.mapPartitions(x=> { x }).collect res101: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,

SparkStreaming

你离开我真会死。 提交于 2019-12-04 08:47:33
SparkStreaming(1) ~ SparkStreaming编程指南 之所以写这部分内容的原因是, 无论是网络上可以直接找到的资料, 还是出版的书籍种种, 版本大都在1.6~2.0不等, 且资源零零散散, 需要到处百度, 搜罗资源. 但根据个人开发了一段时间的感觉来看, 会遇到的绝大多数问题, 都可以在官方文档中找到答案. 因此也可以理解为这是官方文档的部分翻译. 个人英文水平有限, 如有错漏欢迎指正. 就目前来看, 主要分为这样几个板块. Spark Streaming Programming Guide 也即SparkStreaming编程指南. Submitting Applications Spark部署发布相关 Tuning Spark Spark调优 Spark Configuration Spark可用配置, 可选参数. 目前已经有了Spark Streaming的中文翻译. 参考: Spark Streaming编程指南 Spark编程指南 内容本身会比较多, 因此会拆开来, 分多篇介绍. 在这里就不从word count的简单示例开始了, 而是直接从基础概念开始. Maven依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId>

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after “groupByKey” even if the data for a key is very huge?

自闭症网瘾萝莉.ら 提交于 2019-12-04 08:08:11
Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure: Now if I do a groupByKey on this rdd , from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong. If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a

how to divide rdd data into two in spark?

。_饼干妹妹 提交于 2019-12-04 07:45:38
I have a data in Spark RDD and I want to divide it into two part with a scale such as 0.7. For example if the RDD looks like this: [1,2,3,4,5,6,7,8,9,10] I want to divide it into rdd1 : [1,2,3,4,5,6,7] and rdd2 : [8,9,10] with the scale 0.7. The rdd1 and rdd2 should be random every time. I tried this way: seed = random.randint(0,10000) rdd1 = data.sample(False,scale,seed) rdd2 = data.subtract(rdd1) and it works sometimes but when my data contains dict I experienced some problems. For example with data as follows: [{1:2},{3:1},{5:4,2;6}] I get TypeError: unhashable type: 'dict' zero323 Both

Is DAG created when we perform operations over dataframes?

有些话、适合烂在心里 提交于 2019-12-04 06:16:25
问题 I have seen DAG getting generated whenever we perform any operation on RDD but what happens when we perform operations on our dataframe? When executing multiple operations on dataframe, Are those lazily evaluated just like RDD? When the catalyst optimizer comes into the picture? I am sort of confused between these. If anyone can throw some light on these topics, it would be really of great help. Thanks 回答1: Every operation on a Dataset , continuous processing mode notwithstanding, is

Access dependencies available in Scala but no PySpark

为君一笑 提交于 2019-12-04 05:33:52
问题 I am trying to access the dependencies of an RDD. In Scala it is a pretty simple code: scala> val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2) myRdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at <console>:24 scala> myRdd.dependencies res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@6c427386) But dependencies is not available in PySpark. Any pointers on how I can access them? >>> myRdd.dependencies Traceback (most recent call

26.Spark创建RDD集合

拜拜、爱过 提交于 2019-12-04 04:15:00
打开eclipse创建maven项目 pom.xml文件 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.it19gong</groupId> <artifactId>sparkproject</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>sparkproject</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>junit<