rdd | 易学教程

ReduceByKey with a byte array as the key

阅读更多关于 ReduceByKey with a byte array as the key

问题 I would like to work with RDD pairs of Tuple2<byte[], obj> , but byte[] s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but I'm wondering if there's a more efficient way. 回答1: Custom comparers are insufficient because Spark uses the hashCode of the objects to organize keys in partitions. (At least the HashPartitioner will

[Spark streaming举例]-- 统计一段时间内的热搜词

阅读更多关于 [Spark streaming举例]-- 统计一段时间内的热搜词

如下 package com.my.scala import org.apache.spark.SparkConf import org.apache.spark.streaming.{Durations, StreamingContext} /** * * 使用开窗函数实现spark streaming，版本统计一段时间内前三的热搜词汇 * * 测试结果：测试成功 * 步骤：先开启hadoop集群，start-all.sh * 再在h15上启动端口：nc -lk 8888 * 再输入数据：如---》"ds sdf sdfa wfasd sdf",一定要以空格分开 * 启动本程序 * 查看控制台是否正常 * */ object WindowBasedTopWord { def main(args: Array[String]) { val conf = new SparkConf().setAppName("WindowBasedTopWord").setMaster("local[2]") val ssc = new StreamingContext(conf,Durations.seconds(5)) //这里的5秒是指切分RDD的间隔 ssc.checkpoint("hdfs://h15:8020/wordcount_checkpoint") /

ERROR WHILE RUNNING collect() in PYSPARK

阅读更多关于 ERROR WHILE RUNNING collect() in PYSPARK

问题 I am trying to separate the website name from the URL. For example - if the URL is www.google.com, the output should be "google". I tried the below code and everything works fine except the last line - "websites.collect()". I used a dataframe to store the website names and then converted it to a rdd and applied a split function on the values to get my required output. The logic seems to be fine but I guess there is some error in my packages configuration and installation. The error is shown

load a local file to spark using sc.textFile()

阅读更多关于 load a local file to spark using sc.textFile()

Question How to load a file from the local file system to Spark using sc.textFile? Do I need to change any -env variables? Also when I tried the same on my windows where Hadoop is not installed I got the same error. Code > val inputFile = sc.textFile("file///C:/Users/swaapnika/Desktop/to do list") /17 22:28:18 INFO MemoryStore: ensureFreeSpace(63280) called with curMem=0, maxMem=278019440 /17 22:28:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.8 KB, free 265.1 MB) /17 22:28:18 INFO MemoryStore: ensureFreeSpace(19750) called with curMem=63280, maxMem

load a local file to spark using sc.textFile()

阅读更多关于 load a local file to spark using sc.textFile()

问题 Question How to load a file from the local file system to Spark using sc.textFile? Do I need to change any -env variables? Also when I tried the same on my windows where Hadoop is not installed I got the same error. Code > val inputFile = sc.textFile("file///C:/Users/swaapnika/Desktop/to do list") /17 22:28:18 INFO MemoryStore: ensureFreeSpace(63280) called with curMem=0, maxMem=278019440 /17 22:28:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.8 KB, free

Transforming PySpark RDD with Scala

阅读更多关于 Transforming PySpark RDD with Scala

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though. I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized. My question would be: how to get Java strings

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

阅读更多关于 Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

问题 Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex Suggestions are welcome to improve our knowledge. 回答1: map(func) What does it do? Pass each element of the RDD through the supplied function; i.e. func flatMap(func) “Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).” Compare flatMap to map in the following mapPartitions(func) Consider mapPartitions a tool for performance

Spark: java.io.IOException: No space left on device

阅读更多关于 Spark: java.io.IOException: No space left on device

Now I am learning how to use spark.I have a piece of code which can invert a matrix and it works when the order of the matrix is small like 100.But when the order of the matrix is big like 2000 I have an exception like this: 15/05/10 20:31:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/spark-local-20150510200122-effa/28/temp_shuffle_6ba230c3-afed-489b-87aa-91c046cadb22 java.io.IOException: No space left on device In my program I have lots of lines like this: val result1=matrix.map(...).reduce(...) val result2=result1.map(...).reduce(...) val

Spark RDD's - how do they work

阅读更多关于 Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct. Let's say I create an RDD: val rdd = sc.textFile(file) Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)? Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

阅读更多关于 Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex Suggestions are welcome to improve our knowledge. map(func) What does it do? Pass each element of the RDD through the supplied function; i.e. func flatMap(func) “Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).” Compare flatMap to map in the following mapPartitions(func) Consider mapPartitions a tool for performance optimization. It won’t do much for you when running examples on your local machine compared to running across a