rdd

Spark: Expansion of RDD(Key, List) to RDD(Key, Value)

梦想与她 提交于 2019-12-18 06:48:09
问题 So I have an RDD of something like this RDD[(Int, List)]] Where a single element in the RDD looks like (1, List(1, 2, 3)) My question is how can I expand the key value pair to something like this (1,1) (1,2) (1,3) Thank you 回答1: rdd.flatMap { case (key, values) => values.map((key, _)) } 回答2: And in Python (based on @seanowen's answer): rdd.flatMap(lambda x: map(lambda e: (x[0], e), x[1])) 来源: https://stackoverflow.com/questions/36392938/spark-expansion-of-rddkey-list-to-rddkey-value

Spark Mlib FPGrowth job fails with Memory Error

杀马特。学长 韩版系。学妹 提交于 2019-12-18 06:43:20
问题 I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell): from pyspark.mllib.fpm import FPGrowth data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000) # Perform any RDD operation for item in model.freqItemsets().toLocalIterator(): # do something with item I find that

Spark - scala: shuffle RDD / split RDD into two random parts randomly

别来无恙 提交于 2019-12-18 04:43:13
问题 How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%). I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt) But how can I Shuffle the rdd? Or is there a better way to split the list? 回答1: I've found a simple and fast way to split the array: val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03)) It will split the data using the provided weights. 回答2: You should use randomSplit

Writing RDD partitions to individual parquet files in its own directory

强颜欢笑 提交于 2019-12-18 03:36:09
问题 I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be: <root> <entity=entity1> <year=2015> <week=45> data_file.parquet Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else. ​As a preceding step I have all the data loaded

spark架构原理

a 夏天 提交于 2019-12-18 00:53:31
一、spark基础架构 spark和Hadoop的基础架构类似,采用了分布式计算中的Master-Slave模型。 Master是对应集群中的含有Master进程的节点,Slave是集群中含有Worker进程的节点。 1、物理节点逻辑 Master:作为整个集群的控制器,负责整个集群的正常运行,负责接收Client提交的作业,管理Worker,并命令Worker启动Driver和Executor; Worker:相当于是计算节点,负责管理本节点的资源,定期向Master汇报心跳,接收Master的命令,启动Driver和Executor。 Client:作为用户的客户端负责提交应用。 2、管理程序逻辑 Driver: 一个Spark作业运行时包括一个Driver进程,也是作业的主进程,负责作业的解析、生成Stage并调度Task到Executor上。包括DAGScheduler,TaskScheduler。运行在worker,或客户端。并不是运行在master。 Executor:即真正执行作业的地方,一个集群一般包含多个Executor,每个Executor接收Driver的命令Launch Task,一个Executor可以执行一到多个Task。 clusterManager:指的是在集群上获取资源的外部服务。目前有三种类型:1、Standalon,spark原生的资源管理

Enforce partition be stored on the specific executor

与世无争的帅哥 提交于 2019-12-17 19:26:13
问题 I have 5-partitions-RDD and 5 workers/executors. How can I ask Spark to save each RDD's partition on the different worker (IP)? Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers? Means, I can specify the number of partitions, but Spark still can cache everything on a single node. Replication is not an option since RDD is huge. Workarounds I have found getPreferredLocations RDD's getPreferredLocations method does not provide a 100% warranty that

Spark核心 RDD(上)

≡放荡痞女 提交于 2019-12-17 16:47:28
Spark核心 RDD(上) 一、预备知识: x:RDD、分区、流水线操作、Stage... 0、RDD在迭代计算方面比Hadoop快20多倍,计算数据分析类报表的性能提高了40多倍,同时还可以在5-7秒内交互式地查询1TB数据集。RDD适用于具有批量转换需求的应用,并且相同的操作作用于数据集的每一个元素上。 1、RDD持久化:一、persist()内部调用了persist(StorageLevel.MEMORY_ONLY);二、cache()调用了persist( )。 说明 :RDD持久化方法被调用时不会立即缓存,而是触发后面的action时,该RDD将会被缓存在计算节点的内存中,并供后面重用。 2、RDD可序列化:每个partition仅仅是 一个字节数组 而已,大大减少了对象数量,并 降低了内存占用 3、S park Core 生产中只选择MEMORY_ONLY(未序列化的Java对象格式)和MEMORY_ONLY_SER(将RDD中的数据进行序列化)这两种存储级别,Cpu的 计算速度远远大于从内存中读取数据的速度更大于从磁盘中读取数据的速度 ,有时重新计算数据比读取缓存的数据速度更快,故不需要将数据存多份或者存储磁盘,故spark core生产只选择上述两种存储类型 4、 序列化使得占用的内存更少,但是序列化以及反序列化是需要耗时的,同时耗费CPU

Partition RDD into tuples of length n

别说谁变了你拦得住时间么 提交于 2019-12-17 14:13:40
问题 I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable? I have a RDD of the form [m 1 , m 2 , m 3 , m 4 , m 5 , m 6 .......m n ] (you get this when you run rdd.collect()). I was wondering if it was possible to transform this RDD into another RDD of the form [(m 1 , m 2 , m 3 ), (m 4 , m 5 , m 6 ).....(m n-2 , m n-1 , m n )]. The inner tuples should be of size k. If n is not divisible by k, then one of the tuples should have

Partition RDD into tuples of length n

社会主义新天地 提交于 2019-12-17 14:12:07
问题 I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable? I have a RDD of the form [m 1 , m 2 , m 3 , m 4 , m 5 , m 6 .......m n ] (you get this when you run rdd.collect()). I was wondering if it was possible to transform this RDD into another RDD of the form [(m 1 , m 2 , m 3 ), (m 4 , m 5 , m 6 ).....(m n-2 , m n-1 , m n )]. The inner tuples should be of size k. If n is not divisible by k, then one of the tuples should have

How to extract an element from a array in pyspark

左心房为你撑大大i 提交于 2019-12-17 11:53:15
问题 I have a data frame with following type col1|col2|col3|col4 xxxx|yyyy|zzzz|[1111],[2222] I want my output to be following type col1|col2|col3|col4|col5 xxxx|yyyy|zzzz|1111|2222 My col4 is an array and I want to convert it to a separate column. What needs to be done? I saw many answers with flatmap but they are increasing a row, I want just the tuple to be put in another column but in the same row Following is my actual schema: root |-- PRIVATE_IP: string (nullable = true) |-- PRIVATE_PORT: