rdd

Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala

一世执手 提交于 2019-12-14 02:08:52
问题 I am new to Scala and I want to convert dataframe to rdd. let the label, features convert to RDD[labelPoint] for the input of MLlib. But I can't find out the way to deal with WrappedArray . scala> test.printSchema root |-- user_id: long (nullable = true) |-- brand_store_sn: string (nullable = true) |-- label: integer (nullable = true) |-- money_score: double (nullable = true) |-- normal_score: double (nullable = true) |-- action_score: double (nullable = true) |-- features: array (nullable =

How to get the specified output without combineByKey and aggregateByKey in spark RDD

≯℡__Kan透↙ 提交于 2019-12-13 22:56:21
问题 Below is my data: val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D") Now I want below types of output but without using combineByKey and aggregateByKey : 1) Array[(String, Int)] = Array((foo,5), (bar,3)) 2) Array((foo,Set(B, A)), (bar,Set(C, D))) Below is my attempt: scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", | "bar=D", "bar=D") scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0)

PySpark application fail with java.lang.OutOfMemoryError: Java heap space

拜拜、爱过 提交于 2019-12-13 13:28:04
问题 I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error: : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:416) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke

Apache Spark: “SparkException: Task not serializable” in spark-shell for RDD constructed manually

…衆ロ難τιáo~ 提交于 2019-12-13 07:52:57
问题 I have the following code to detect most used top level domain from events. I use it to get date via Spark SQL. Functions themselves are tested and work fine. I use Amazon EMR and spark-shell. When spark sends tasks to nodes, almost immediately, I receive a long stack trace and "SparkException: Task not serializable" in the end without anything specific. What's the deal here? import scala.io.Source val suffixesStr = Source.fromURL("https://publicsuffix.org/list/public_suffix_list.dat")

How can I use reduceByKey instead of GroupByKey to construct a list?

*爱你&永不变心* 提交于 2019-12-13 04:25:45
问题 My RDD is made of many items, each of which is a tuple as follows: (key1, (val1_key1, val2_key1)) (key2, (val1_key2, val2_key2)) (key1, (val1_again_key1, val2_again_key1)) ... and so on I used GroupByKey on the RDD which gave the result as (key1, [(val1_key1, val2_key1), (val1_again_key1, val2_again_key1), (), ... ()]) (key2, [(val1_key2, val2_key2), (), () ... ())]) ... and so on I need to do the same using reduceByKey. I tried doing RDD.reduceByKey(lambda val1, val2: list(val1).append(val2)

Update the internal state of RDD elements

余生长醉 提交于 2019-12-13 03:07:36
问题 I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example: class Test extends Serializable{ var foo = 0.0 var bar = 0.0 def updateFooBar() = { foo = Math.random() bar = Math.random() } } var testList = Array.fill(5)(new Test()) var testRDD = sc.parallelize(testList) testRDD.foreach{ x => x.updateFooBar() } testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) } and the result is: 0.0~0.0 0.0~0.0 0

Write dataframe to csv with datatype map<string,bigint> in Spark

余生颓废 提交于 2019-12-13 02:48:05
问题 I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying "Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type." Code which I have used: val conf=new SparkConf().setAppName("student-example").setMaster("local") val sc = new SparkContext(conf) val sqlcontext =

spark调优(一):开发调优

喜你入骨 提交于 2019-12-13 02:04:05
转发学习自美团技术团队: https://tech.meituan.com 文章总体概览 优化开发原则: 1、避免创建重复rdd 2、尽可能复用rdd:避免重复计算 3、合适的持久化策略: memory_only memory_only_ser memory_and_disk等 4、尽量避免shuffle算子: 如reduceByKey、join、distinct等,主要为了减少磁盘、网络IO。在 join操作中 使用小表广播变量可以避免shuffle。 5、map端预聚合: 在无法避免shuffle操作的情况下采用map端预聚合,当然前提是不影响业务逻辑。 如使用reduceByKey、aggregateByKey替代groupByKey。 6、使用高性能算子: 如mapPartitions替代map、foreachPartiton替代foreach(特别是在数据库操作中)、filter之后进行coalesce操作缩减分区、使用reparationAndSortWithinPartition替代reparation和sort操作。 7、广播大变量 8、使用kyro序列化: 比默认的java序列化性能高10倍左右,rdd使用kyro序列化需要进行注册、但在使用DF、DS结构时默认会使用kyro序列化。 9、优化数据结构 正文: 1、前言 在大数据计算领域

PySpark: Convert a pair RDD back to a regular RDD

ε祈祈猫儿з 提交于 2019-12-13 01:35:08
问题 Is there any way I can convert a pair RDD back to a regular RDD? Suppose I get a local csv file, and I first load it as a regular rdd rdd = sc.textFile("$path/$csv") Then I create a pair rdd (i.e. key is the string before "," and value is the string after ",") pairRDD = rdd.map(lambda x : (x.split(",")[0], x.split(",")[1])) I store the pairRDD by using the saveAsTextFile() pairRDD.saveAsTextFile("$savePath") However, as investigated, the stored file will contain some necessary characters,

Ordered union on spark RDDs

喜你入骨 提交于 2019-12-13 01:27:45
问题 I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have