rdd

PySpark Suggestion on how to organize RDD

允我心安 提交于 2020-01-06 23:45:23
问题 I'm a Spark noobie and I'm trying to test something out on Spark and see if there are any performance boosts for the size of data that I'm using. Each object in my rdd contains a time, id, and position. I want to compare the positions of groups with same times containing the same id. So, I would first run the following to get grouped by id grouped_rdd = rdd.map(lambda x: (x.id, [x])).groupByKey() I would then like to break this into the time of each object. Any suggestions? Thanks! 回答1: First

Spark Streaming

折月煮酒 提交于 2020-01-06 23:12:18
Spark Streaming介绍 • Spark Streaming是Spark核心API的一个扩展,可以实现高吞吐量的、具备容错机制的实时流数据的处理 • 支持多种数据源获取数据: • Spark Streaming接收Kafka、Flume、HDFS等各种来源的实时输入数据,进行处理后,处理结构保存在HDFS、DataBase等各种地方 Spark Core和Spark Streaming 两者关系: • Spark Streaming将接收到的实时流数据,按照一定时间间隔,对数据进行拆分,交给Spark Engine引擎处理,最终得到一批批的结果 • 每一批数据,在Spark内核对应一个RDD实例 • Dstream可以看做一组RDDs,即RDD的一个序列 DStream • Dstream:Spark Streaming提供了表示连续数据流的、高度抽象的被称为离散流的DStream • 任何对DStream的操作都会转变为对底层RDD的操作 • Spark Streaming程序中一般会有若干个对DStream的操作。DStreamGraph就是由这些操作的依赖关系构成 • 将连续的数据持久化、离散化,然后进行批量处理 为什么? – 数据持久化:接收到的数据暂存 – 离散化:按时间分片,形成处理单元 – 分片处理:分批处理 作用Dstream上的Operation分成两类

【Spark】spark randomSplit glom函数操作详解

不打扰是莪最后的温柔 提交于 2020-01-06 22:00:24
该函数根据weights权重,将一个RDD切分成多个RDD。 该权重参数为一个Double数组 第二个参数为random的种子,基本可忽略。 scala> var rdd = sc .makeRDD ( 1 to 10 , 10 ) rdd: org .apache .spark .rdd .RDD [Int] = ParallelCollectionRDD[ 16 ] at makeRDD at : 21 scala> rdd .collect res6: Array[Int] = Array( 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ) scala> var splitRDD = rdd .randomSplit (Array( 1.0 , 2.0 , 3.0 , 4.0 )) splitRDD: Array[org .apache .spark .rdd .RDD [Int]] = Array(MapPartitionsRDD[ 17 ] at randomSplit at : 23 , MapPartitionsRDD[ 18 ] at randomSplit at : 23 , MapPartitionsRDD[ 19 ] at randomSplit at : 23 , MapPartitionsRDD[ 20 ] at

Spark - missing 1 required position argument (lambda function)

人走茶凉 提交于 2020-01-06 06:42:28
问题 I'm trying to distribute some text extraction from PDFs between multiple servers using Spark. This is using a custom Python module I made and is an implementation of this question. The 'extractTextFromPdf' function takes 2 arguments: a string representing the path to the file, and a configuration file used to determine various extraction constraints. In this case the config file is just a simple YAML file sitting in the same folder as the Python script running the extraction and the files are

ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy in instance of org.apache.spark.rdd.MapPartitionsRDD

走远了吗. 提交于 2020-01-06 04:33:08
问题 I have a spring boot microservice talking to a remote spark cluster with 3 nodes, and executing the following logic: Dataset<Row> df = sparkSession.read().json("/opt/enso/test.json"); StructType schema = df.schema(); JavaPairRDD<Row, Long> zippedRows = df.toJavaRDD().zipWithIndex(); JavaPairRDD<Row, Long> filteredRows = zippedRows.filter(new Function<Tuple2<Row,Long>,Boolean> () { @Override public Boolean call(Tuple2<Row,Long> v1) throws Exception { return v1._2 >= 1 && v1._2 <= 5; } });

Words normalization using RDD

荒凉一梦 提交于 2020-01-05 07:02:16
问题 Maybe this question is a little bit strange... But I'll try to ask it. Everyone, who wrote applications with using Lucene API, seen something like this: public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException { TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text)); tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));

Words normalization using RDD

谁都会走 提交于 2020-01-05 07:00:26
问题 Maybe this question is a little bit strange... But I'll try to ask it. Everyone, who wrote applications with using Lucene API, seen something like this: public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException { TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text)); tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));

Changing an attribute in an object that belongs to RDD

醉酒当歌 提交于 2020-01-05 04:42:21
问题 I have the following code : def generateStoriesnew(outputPath: String, groupedRDD:RDD[(String,Iterable[String])], isInChurnMode: Boolean, isInChurnPeriod: Boolean) { val windowedRDD = groupedRDD.map(SOME CODE) var windowedRDD2 = windowedRDD.filter(r => r != null).map(a=>a.churnPeriod(isInChurnPeriod,isInChurnMode)) val prettyStringRDD = windowedRDD2.map(r => { r.toString }) prettyStringRDD.saveAsTextFile(outputPath) } and here is the code for ChurnPriod function: def churnPeriod( churnPeriod

Spark SQL的官网解释

霸气de小男生 提交于 2020-01-05 00:27:17
一.官网位置 1.位置 2.解释 官网位置 DataSet1.6出现的 SchemaRDD < 1.3 1.3版本前叫 SchemaRDD 1.3以后 叫DataFrame DataSet支持 Scala , JAVA 不支持python DataFrame 支持四种 JAVA,Scala.Python,R DataFrame:并不是spark sql独创的,原来就有的,从其他框架借鉴过来的 二.DataFrame 注意事项 1.注意 分布式的数据集 按列进行组织的 就是等于关系型数据库总的一张表 DataFrame=DataSet[Row] 类型是Row 三.DataFram 与RDD的区别 1.定义层面 RDD定义里面有泛型 RDD[person ] RDD不知道Person里面有什么的 DataFrame 不一样 ,里面是张表,所以暴露的信息多 2.底层方面 RDD开发各种语言有各自的运行环境,所以性能不一样,差异很大,但是DataFrame 是统一都经 过计划,在执行,不用管语言开发,性能差不多 3.API方面 DataFrame 比RDD 更加丰富 三.其余注意事项 1.注意点 Spark SQL入口点 2.0版本 <2: SQLContext HiveContext >=2: SparkSession spark-shell 启动会默认启动sc,spark 两个

Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

人走茶凉 提交于 2020-01-04 05:26:16
问题 I know that some of Spark Actions like collect() cause performance issues. It has been quoted in documentation To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println) . This can cause the driver to run out of memory , though, because collect() fetches the entire RDD to a single machine ; if you only need to print a few elements of the RDD, a safer approach is to use the take() : rdd.take(100).foreach