rdd

ReduceByKey with a byte array as the key

我怕爱的太早我们不能终老 提交于 2019-11-30 18:51:49
问题 I would like to work with RDD pairs of Tuple2<byte[], obj> , but byte[] s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. 回答1: Custom comparers are insufficient because Spark uses the hashCode of the objects to organize keys in partitions. (At least the HashPartitioner will

[Spark streaming举例]-- 统计一段时间内的热搜词

这一生的挚爱 提交于 2019-11-30 18:06:25
如下 package com.my.scala import org.apache.spark.SparkConf import org.apache.spark.streaming.{Durations, StreamingContext} /** * * 使用开窗函数实现spark streaming,版本统计一段时间内前三的热搜词汇 * * 测试结果:测试成功 * 步骤: 先开启hadoop集群,start-all.sh * 再在h15上启动端口:nc -lk 8888 * 再输入数据:如---》"ds sdf sdfa wfasd sdf",一定要以空格分开 * 启动本程序 * 查看控制台是否正常 * */ object WindowBasedTopWord { def main(args: Array[String]) { val conf = new SparkConf().setAppName("WindowBasedTopWord").setMaster("local[2]") val ssc = new StreamingContext(conf,Durations.seconds(5)) //这里的5秒是指切分RDD的间隔 ssc.checkpoint("hdfs://h15:8020/wordcount_checkpoint") /

ERROR WHILE RUNNING collect() in PYSPARK

北城余情 提交于 2019-11-30 17:31:20
问题 I am trying to separate the website name from the URL. For example - if the URL is www.google.com, the output should be "google". I tried the below code and everything works fine except the last line - "websites.collect()". I used a dataframe to store the website names and then converted it to a rdd and applied a split function on the values to get my required output. The logic seems to be fine but I guess there is some error in my packages configuration and installation. The error is shown

load a local file to spark using sc.textFile()

一世执手 提交于 2019-11-30 16:43:30
Question How to load a file from the local file system to Spark using sc.textFile? Do I need to change any -env variables? Also when I tried the same on my windows where Hadoop is not installed I got the same error. Code > val inputFile = sc.textFile("file///C:/Users/swaapnika/Desktop/to do list") /17 22:28:18 INFO MemoryStore: ensureFreeSpace(63280) called with curMem=0, maxMem=278019440 /17 22:28:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.8 KB, free 265.1 MB) /17 22:28:18 INFO MemoryStore: ensureFreeSpace(19750) called with curMem=63280, maxMem

load a local file to spark using sc.textFile()

安稳与你 提交于 2019-11-30 16:13:58
问题 Question How to load a file from the local file system to Spark using sc.textFile? Do I need to change any -env variables? Also when I tried the same on my windows where Hadoop is not installed I got the same error. Code > val inputFile = sc.textFile("file///C:/Users/swaapnika/Desktop/to do list") /17 22:28:18 INFO MemoryStore: ensureFreeSpace(63280) called with curMem=0, maxMem=278019440 /17 22:28:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.8 KB, free

Transforming PySpark RDD with Scala

六眼飞鱼酱① 提交于 2019-11-30 14:56:05
TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though. I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized. My question would be: how to get Java strings

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

孤者浪人 提交于 2019-11-30 14:35:45
问题 Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex Suggestions are welcome to improve our knowledge. 回答1: map(func) What does it do? Pass each element of the RDD through the supplied function; i.e. func flatMap(func) “Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).” Compare flatMap to map in the following mapPartitions(func) Consider mapPartitions a tool for performance

Spark: java.io.IOException: No space left on device

寵の児 提交于 2019-11-30 14:19:50
Now I am learning how to use spark.I have a piece of code which can invert a matrix and it works when the order of the matrix is small like 100.But when the order of the matrix is big like 2000 I have an exception like this: 15/05/10 20:31:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/spark-local-20150510200122-effa/28/temp_shuffle_6ba230c3-afed-489b-87aa-91c046cadb22 java.io.IOException: No space left on device In my program I have lots of lines like this: val result1=matrix.map(...).reduce(...) val result2=result1.map(...).reduce(...) val

Spark RDD's - how do they work

时光怂恿深爱的人放手 提交于 2019-11-30 11:00:47
I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct. Let's say I create an RDD: val rdd = sc.textFile(file) Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)? Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

喜夏-厌秋 提交于 2019-11-30 10:58:04
Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex Suggestions are welcome to improve our knowledge. map(func) What does it do? Pass each element of the RDD through the supplied function; i.e. func flatMap(func) “Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).” Compare flatMap to map in the following mapPartitions(func) Consider mapPartitions a tool for performance optimization. It won’t do much for you when running examples on your local machine compared to running across a