rdd

Spark processing columns in parallel

跟風遠走 提交于 2019-12-09 17:44:20
问题 I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column. In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially. A simple example: if my data is 5 column text delimited file and each

Comprehensive Introduction to Apache Spark

北城以北 提交于 2019-12-09 16:43:52
Introduction Industry estimates that we are creating more than 2.5 Quintillion bytes of data every year. Think of it for a moment – 1 Qunitillion = 1 Million Billion! Can you imagine how many drives / CDs / Blue-ray DVDs would be required to store them? It is difficult to imagine this scale of data generation even as a data science professional. While this pace of data generation is very exciting, it has created entirely new set of challenges and has forced us to find new ways to handle Big Huge data effectively. Big Data is not a new phenomena. It has been around for a while now. However, it

大数据Spark性能优化指南基础

烈酒焚心 提交于 2019-12-09 15:04:36
在大数据计算领域,Spark已经成为了越来越流行、越来越受欢迎的计算平台之一。Spark的功能涵盖了大数据领域的离线批处理、SQL类处理、流式/实时计算、机器学习、图计算等各种不同类型的计算操作,应用范围与前景非常广泛。在美团•大众点评,已经有很多同学在各种项目中尝试使用Spark。大多数同学(包括笔者在内),最初开始尝试使用Spark的原因很简单,主要就是为了让大数据计算作业的执行速度更快、性能更高。 然而,通过Spark开发出高性能的大数据计算作业,并不是那么简单的。如果没有对Spark作业进行合理的调优,Spark作业的执行速度可能会很慢,这样就完全体现不出Spark作为一种快速大数据计算引擎的优势来。因此,想要用好Spark,就必须对其进行合理的性能优化。 Spark的性能调优实际上是由很多部分组成的,不是调节几个参数就可以立竿见影提升作业性能的。我们需要根据不同的业务场景以及数据情况,对Spark作业进行综合性的分析,然后进行多个方面的调节和优化,才能获得最佳性能。 笔者根据之前的Spark作业开发经验以及实践积累,总结出了一套Spark作业的性能优化方案。整套方案主要分为开发调优、资源调优、数据倾斜调优、shuffle调优几个部分。开发调优和资源调优是所有Spark作业都需要注意和遵循的一些基本原则,是高性能Spark作业的基础;数据倾斜调优

What is the result of RDD transformation in Spark?

懵懂的女人 提交于 2019-12-09 13:36:42
问题 Can anyone explain, what is the result of RDD transformations ? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data? 回答1: RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages

Spark: Transformation和Action

霸气de小男生 提交于 2019-12-09 09:47:52
本文提供的是0.7.3版本中的action和transformation接口,RDD提供了两种类型的操作:transformation和action 1,transformation是得到一个新的RDD,方式很多,比如从数据源生成一个新的RDD,从RDD生成一个新的RDD 2,action是得到一个值,或者一个结果(直接将RDD cache到内存中) 所有的transformation都是采用的懒策略,就是如果只是将transformation提交是不会执行计算的,计算只有在action被提交的时候才被触发。下面介绍一下RDD的 常见操作 :(注意是dataset还是RDD) transformation操作: map(func):对调用map的RDD数据集中的每个element都使用func,然后返回一个新的RDD,这个返回的数据集是分布式的数据集 filter(func) : 对调用filter的RDD数据集中的每个元素都使用func,然后返回一个包含使func为true的元素构成的RDD flatMap(func):和map差不多,但是flatMap生成的是多个结果 mapPartitions(func):和map很像,但是map是每个element,而mapPartitions是每个partition mapPartitionsWithSplit(func)

Why is Spark fast when word count? [duplicate]

依然范特西╮ 提交于 2019-12-09 01:57:35
问题 This question already has answers here : Why is Spark faster than Hadoop Map Reduce (2 answers) Closed 2 years ago . Test case: word counting in 6G data in 20+ seconds by Spark. I understand MapReduce , FP and stream programming models, but couldn’t figure out the word counting is so amazing fast. I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic

Spark: difference when read in .gz and .bz2

可紊 提交于 2019-12-09 00:39:25
问题 I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions? Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks! 回答1: However, if I read in one single .bz2,

How to filter a dataset according to datetime values in Spark

╄→尐↘猪︶ㄣ 提交于 2019-12-08 16:33:39
问题 I am trying to filter my data according to it's datetime field. A sample from my data: 303,0.00001747,4351040,75.9054,"2019-03-08 19:29:18" This is how I initialize spark: SparkConf conf = new SparkConf().setAppName("app name").setMaster("spark://192.168.1.124:7077"); JavaSparkContext sc = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate(conf)); Firstly, I read the data above into my custom object like below: // Read data from file into custom object JavaRDD<CurrencyPair> rdd = sc

Converting a Scala Iterable[tuple] to RDD

核能气质少年 提交于 2019-12-08 15:48:28
问题 I have a list of tuples, (String, String, Int, Double) that I want to convert to Spark RDD. In general, how do I convert a Scala Iterable[(a1, a2, a3, ..., an)] into a Spark RDD? 回答1: There are a few ways to do this, but the most straightforward way is just to use Spark Context: import org.apache.spark._ import org.apache.spark.rdd._ import org.apache.spark.SparkContext._ sc.parallelize(YourIterable.toList) I think sc.Parallelize needs a conversion to List, but it will preserve your structure

ValueError: could not convert string to float in Pyspark

ε祈祈猫儿з 提交于 2019-12-08 14:20:35
问题 my spark RDD looks something like this totalDistance=flightsParsed.map(lambda x:x.distance) totalDistance.take(5) [1979.0, 640.0, 1947.0, 1590.0, 874.0] But when i run reduce on it I get error as mentioned below totalDistance=flightsParsed.map(lambda x:x.distance).reduce(lambda y,z:y+z) ValueError: could not convert string to float: Please help. 回答1: Did you try: totalDistance=flightsParsed.map(lambda x: int(x.distance or 0)) or totalDistance=flightsParsed.map(lambda x: float(x.distance or 0)