rdd

Spark调优指南

情到浓时终转凉″ 提交于 2020-01-03 08:13:03
Spark相关问题 Spark 比 MR 快的原因? 1) Spark 的计算结果可以放入内存,支持基于内存的迭代, MR 不支持。 2) Spark 有 DAG 有向无环图,可以实现 pipeline 的计算模式。 3) 资源调度模式: Spark 粗粒度资源调度, MR 是细粒度资源调度。 资源复用: Spark 中的 task 可以复用同一批 Executor 的资源。 MR 里面每一个 map task 对应一个 jvm ,不能复用资源。 Spark 中主要进程的作用? Driver 进程:负责任务的分发和结果的回收。 Executor 进程:负责具体任务的执行。 Master 进程: Spark 资源管理的主进程,负责资源调度。 Worker 进程: Spark 资源管理的从进程, woker 节点主要运行 Executor Spark调优 1. 资源调优 1) .搭建Spark集群的时候要给Spark集群足够的资源(core,memory) 在 spark安装包的conf下spark-env.sh SPARK_WORKER_CORES SPARK_WORKER_MEMORY SPARK_WORKER_INSTANCE 2) .在提交Application的时候给Application分配更多的资源。 提交命令选项:(在提交 Application的时候使用选项) -

Spark: Writing RDD Results to File System is Slow

*爱你&永不变心* 提交于 2020-01-03 05:47:06
问题 I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup ). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue

transform - (SparkStreaming算子)

匆匆过客 提交于 2020-01-02 23:55:49
transform 一种转换算子 应用在DStream上,可以用于执行任意的RDD到RDD的转换操作。他可以用于实现,DStream API中所没有提供的操作。 package com . shsxt . spark . scala import org . apache . spark . SparkConf import org . apache . spark . streaming . { Seconds , StreamingContext } /** * Created by BF-Lone Silver Wind on 2020-01-02 */ object transform { def main ( args : Array [ String ] ) : Unit = { val conf = new SparkConf ( ) . setMaster ( "local[2]" ) . setAppName ( "Tranform" ) val ssc = new StreamingContext ( conf , Seconds ( 5 ) ) val fileDS = ssc . socketTextStream ( "192.168.241.211" , 9999 ) val wordcountDS = fileDS . flatMap { line =

Passing class functions to PySpark RDD

独自空忆成欢 提交于 2020-01-02 22:01:37
问题 I have a class named some_class() in a Python file here: /some-folder/app/bin/file.py I am importing it to my code here: /some-folder2/app/code/file2.py By import sys sys.path.append('/some-folder/app/bin') from file import some_class clss = some_class() I want to use this class's function named some_function in map of spark sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x)) This is giving me an error : No module named file While class.some_function when I am calling it

Spark Streaming官方文档翻译基本概念之输出操作

眉间皱痕 提交于 2020-01-02 21:56:47
Spark Streaming官方文档翻译Spark Streaming总览 Spark Streaming官方文档翻译基本概念之初始化与Dstream Spark Streaming官方文档翻译基本概念之输入数据流和接收器 Spark Streaming官方文档翻译基本概念之转换操作 Spark Streaming官方文档翻译基本概念之输出操作 Spark Streaming官方文档翻译基本概念之sql与Mllib Spark Streaming官方文档基本概念之缓存与检查点 Spark Streaming官方文档翻译基本概念之累加器、广播变量和检查点 Spark Streaming官方文档翻译Spark Streaming应用之部署,升级,监控 Spark Streaming官方文档翻译Spark Streaming性能调优 Spark Streaming官方文档翻译Spark Streaming容错 Spark Streaming官方文档翻译Spark Streaming +Kafka 集成指南 Spark Streaming官方文档翻译Spark Streaming自定义接收器 基本概念 DStreams上的输出操作(Output Operations on DStreams) 输出操作允许将DStream的数据推送到外部系统,如数据库或文件系统

day12-spark RDD

时光总嘲笑我的痴心妄想 提交于 2020-01-02 20:25:25
前言 day11 ,我们学习了Spark HA模式、Spark submit及spark shell的演示。今天学习Spark RDD。 RDD RDD中文名是弹性分布式数据集,它是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素可并行计算的集合。 RDD的创建 RDD的创建方式,有两种: 通过外部的数据文件创建,如HDFS。 如:sc.textFile(“hdfs://bigdata121:9000/one”) 通过sc.parallelize进行创建 如:sc.parallelize(Array(1,2,3,4,5,6,7,8),3) 算子 RDD调用的方法称之为算子,算子分Transformation和Action两种。 Transformation:延迟计算,所有转换都是延迟加载的,它的底层使用的是 Lazy ,不会立即触发计算。 Action:立即计算。 RDD缓存机制 RDD通过persist方法或cache方法可以将前面的计算结果缓存,但是并不是这两个方法被调用时立即缓存,而是触发后面的action时,该RDD将会被缓存在计算节点的内存中,并供后面重用。 通过查看源码发现cache最终也是调用了persist方法,默认的存储级别都是仅在内存存储一份,Spark的存储级别还有好多种,存储级别在object StorageLevel中定义的

Spark select top values in RDD

帅比萌擦擦* 提交于 2020-01-02 02:55:50
问题 The original dataset is: # (numbersofrating,title,avg_rating) newRDD =[(3,'monster',4),(4,'minions 3D',5),....] I want to select top N avg_ratings in newRDD.I use the following code,it has an error. selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......)) TypeError: map() takes no keyword arguments The expected data should be: # (numbersofrating,title,avg_rating) selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....] 回答1: You can use either top or takeOrdered with key argument:

spark in python: creating an rdd by loading binary data with numpy.fromfile

对着背影说爱祢 提交于 2020-01-01 19:54:30
问题 The spark python api currently has limited support for loading large binary data files, and so I tried to get numpy.fromfile to help me out. I first got a list of filenames I'd like to load, e.g.: In [9] filenames Out[9]: ['A0000.dat', 'A0001.dat', 'A0002.dat', 'A0003.dat', 'A0004.dat'] I can load these files without problems with a crude iterative unionization, for i in range(len(filenames)): rdd = sc.parallelize([np.fromfile(filenames[i], dtype="int16", count=-1, sep='')]) if i==0: allRdd =

Getting error in Spark: Executor lost

对着背影说爱祢 提交于 2020-01-01 18:20:26
问题 I have one master and two slaves each running on 32 GB of RAM and I'm reading a csv file with around 18 million records (the first row are the headers for the columns). This is the command I am using to run the job ./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file> I did the following rdd = sc.textFile("<path/to/file>") h = rdd.first() header_rdd = rdd.map(lambda l: h in l) data_rdd = rdd.subtract(header_rdd) data_rdd.first() I'm getting the following

Pyspark calculate custom distance between all vectors in a RDD

限于喜欢 提交于 2020-01-01 16:45:32
问题 I have a RDD consisting of dense vectors which contain probability distribution like below [DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.2252, 0.0422, 0.0864, 0.0441, 0.0592, 0.0439, 0.0433, 0.071, 0.1644, 0.0405, 0.0581, 0.0528, 0.0691]), DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.0924, 0.0699, 0.083, 0.0706, 0.0766,