rdd

第1课:通过案例对Spark Streaming透彻理解

旧城冷巷雨未停 提交于 2019-12-11 13:31:41
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 一.SparkStreaming在线另类实验 ​ ​ 如何清晰的看到数据的流入、被处理的过程?使用一个小技巧,通过调节放大BatchInterval的方式,来降低批处理次数,以方便看清楚各个环节。我们从已写过的广告点击的在线黑名单过滤的SparkStreaming应用程序入手。一下是具体的实验源码: ​ package com.dt.spark.streaming import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} /** * 背景描述:在广告点击计费系统中,我们在线过滤掉黑名单的点击,进而保护广告商的利益, * 只进行有效的广告点击计费。或者在防刷评分(或者流量)系统,过滤掉无效的投票或者评分或者流量。 * 实现技术:使用transform API直接基于RDD编程,进行join操作 * * Created by Administrator on 2016/4/30. */ object OnlineBlackListFilter { def main(args: Array[String]) { /** * 第一步:创建Spark的配置对象

第2课:通过案例对SparkStreaming 透彻理解三板斧之二:解密SparkStreaming

故事扮演 提交于 2019-12-11 13:27:12
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 从昨天第一课的黑名单过滤的案例中,我們可以看見其實一個Spark Streaming 程序,里面會自動生成很多不同的作業,可以用以下的圖,去理解什麼是DStream,它跟RDD 之間有什麼不同。 簡單說 DStream 是Batch Process +RDD ,在每一段時間間隔里會產生 RDD。 下圖是一個 Y 轴跟 X 轴組成的一張圖。 Y 轴是空間維度,代表是 RDD 的依賴關係構成的具體的處理邏輯的步驟,是用DStream Graph 來表示的。 X 轴是時間維度,按照特定時間間隔不斷的生成 Job 的實例並在集群上運行。 DStream 跟 RDD 的 空間維度是一樣的,只不過是時間維度不同 導致每次處理的數據跟結果不一樣而已。隨著時間的流程基於 DStream Graph 不斷的生成以 RDD Graph 也就是 DAG 的方式產生 Job 並通過 Job Scheduler 的线程池的方式提交給 Spark Cluster 不斷的執行。 以下5點是很重要的: 需要 RDD DAG 的生成模板 需要基於 Timeline 的 Job 控制器 InputStream 和 OutputStream 代表數據的輸入和輸出 具體 Job 運行在 Spark Cluster 之上,此時系統容錯就至關重要

find the minimum and maximum date from the data in a RDD in PySpark

时光怂恿深爱的人放手 提交于 2019-12-11 12:48:47
问题 I am using Spark with Ipython and have a RDD which contains data in this format when printed: print rdd1.collect() [u'2010-12-08 00:00:00', u'2010-12-18 01:20:00', u'2012-05-13 00:00:00',....] Each data is a datetimestamp and I want to find the minimum and the maximum in this RDD . How can I do that? 回答1: You can for example use aggregate function (for an explanation how it works see: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?) from datetime import

Serialization and Custom Spark RDD Class

懵懂的女人 提交于 2019-12-11 12:44:59
问题 I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get: customRDD.count to succeed without an Exception. Right now this is what I'm getting: 15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

Can only zip with RDD which has the same number of partitions error

匆匆过客 提交于 2019-12-11 11:07:17
问题 I have a ipython notebook which has pyspark code and it works fine on my machine but when I try to run it on a different machine it throws error at this line (rdd3 line): rdd2 = sc.parallelize(list1) rdd3 = rdd1.zip(rdd2).map(lambda ((x1,x2,x3,x4), y): (y,x2, x3, x4)) list = rdd3.collect() The error I get is: ValueError Traceback (most recent call last) <ipython-input-7-9daab52fc089> in <module>() ---> 16 rdd3 = rdd1.zip(rdd2).map(lambda ((x1,x2,x3,x4), y): (y,x2, x3, x4)) /usr/local/bin

How to merge RDD array

冷暖自知 提交于 2019-12-11 09:25:31
问题 I have a RDD array: Array[RDD[(String, Double)]] , how to merge those RDDs into RDD[String, Array[Double]] . For example: RDD Array: [[('x', 1), ('y', 2)], [('x', 3), ('y', 4)],...] => RDD: [('x', [1, 3,...]), ('y', [2, 4, ...])] Any help appreciated! Thanks 回答1: You should merge the array of RDDS into one RDD (line 1) Group them by the String value (line 2) I see that the expected output is sorted, if it is required you can sort the values (line 3) val mergeIntoOne: RDD[(String, Double)] =

Multiple Partitions in Spark RDD

流过昼夜 提交于 2019-12-11 08:32:56
问题 So I am trying to get data from a MySQL database using Spark within a Play/Scala project. Since the amount of rows I am trying to receive is huge, my aim is to get an Iterator from the spark rdd. Here is the Spark context and configuration... private val configuration = new SparkConf() .setAppName("Reporting") .setMaster("local[*]") .set("spark.executor.memory", "2g") .set("spark.akka.timeout", "5") .set("spark.driver.allowMultipleContexts", "true") val sparkContext = new SparkContext

In which situations are the stages of DAG skipped?

≡放荡痞女 提交于 2019-12-11 08:08:27
问题 I am trying to find the situations in which Spark would skip stages in case I am using RDDs. I know that it will skip stages if there is a shuffle operation happening. So, I wrote the following code to see if it is true: def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local").setAppName("demo") val sc = new SparkContext(conf) val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)) val c=d.rightOuterJoin(d.reduceByKey(_+_)).collect val f=d.leftOuterJoin(d

Spark rdd correct date format in scala?

落花浮王杯 提交于 2019-12-11 07:36:37
问题 This is the date value I want to use when I convert RDD to Dataframe. Sun Jul 31 10:21:53 PDT 2016 This schema "DataTypes.DateType" throws an error. java.util.Date is not a valid external type for schema of date So I want to prepare RDD in advance in such a way that above schema can work. How can I correct the date format to work in conversion to dataframe? //Schema for data frame val schema = StructType( StructField("lotStartDate", DateType, false) :: StructField("pm", StringType, false) ::

How to solve Type mismatch issue (expected: Double, actual: Unit)

喜夏-厌秋 提交于 2019-12-11 05:38:58
问题 Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit) . I tried many different ways to solve this issue, but still without success. Any ideas? def calculateRMSE(output: DStream[(Double, Double)]): Double = { val summse = output.foreachRDD { rdd => rdd.map { case pair: (Double, Double) => val err = math.abs(pair._1 - pair._2); err*err }.reduce(_ + _) } // math.sqrt(summse)