rdd | 易学教程

Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala

阅读更多关于 Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala

问题 I am new to Scala and I want to convert dataframe to rdd. let the label, features convert to RDD[labelPoint] for the input of MLlib. But I can't find out the way to deal with WrappedArray . scala> test.printSchema root |-- user_id: long (nullable = true) |-- brand_store_sn: string (nullable = true) |-- label: integer (nullable = true) |-- money_score: double (nullable = true) |-- normal_score: double (nullable = true) |-- action_score: double (nullable = true) |-- features: array (nullable =

How to get the specified output without combineByKey and aggregateByKey in spark RDD

阅读更多关于 How to get the specified output without combineByKey and aggregateByKey in spark RDD

问题 Below is my data: val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D") Now I want below types of output but without using combineByKey and aggregateByKey : 1) Array[(String, Int)] = Array((foo,5), (bar,3)) 2) Array((foo,Set(B, A)), (bar,Set(C, D))) Below is my attempt: scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", | "bar=D", "bar=D") scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0)

PySpark application fail with java.lang.OutOfMemoryError: Java heap space

阅读更多关于 PySpark application fail with java.lang.OutOfMemoryError: Java heap space

问题 I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error: : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:416) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke

Apache Spark: “SparkException: Task not serializable” in spark-shell for RDD constructed manually

阅读更多关于 Apache Spark: “SparkException: Task not serializable” in spark-shell for RDD constructed manually

问题 I have the following code to detect most used top level domain from events. I use it to get date via Spark SQL. Functions themselves are tested and work fine. I use Amazon EMR and spark-shell. When spark sends tasks to nodes, almost immediately, I receive a long stack trace and "SparkException: Task not serializable" in the end without anything specific. What's the deal here? import scala.io.Source val suffixesStr = Source.fromURL("https://publicsuffix.org/list/public_suffix_list.dat")

How can I use reduceByKey instead of GroupByKey to construct a list?

阅读更多关于 How can I use reduceByKey instead of GroupByKey to construct a list?

问题 My RDD is made of many items, each of which is a tuple as follows: (key1, (val1_key1, val2_key1)) (key2, (val1_key2, val2_key2)) (key1, (val1_again_key1, val2_again_key1)) ... and so on I used GroupByKey on the RDD which gave the result as (key1, [(val1_key1, val2_key1), (val1_again_key1, val2_again_key1), (), ... ()]) (key2, [(val1_key2, val2_key2), (), () ... ())]) ... and so on I need to do the same using reduceByKey. I tried doing RDD.reduceByKey(lambda val1, val2: list(val1).append(val2)

Update the internal state of RDD elements

阅读更多关于 Update the internal state of RDD elements

问题 I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example: class Test extends Serializable{ var foo = 0.0 var bar = 0.0 def updateFooBar() = { foo = Math.random() bar = Math.random() } } var testList = Array.fill(5)(new Test()) var testRDD = sc.parallelize(testList) testRDD.foreach{ x => x.updateFooBar() } testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) } and the result is: 0.0~0.0 0.0~0.0 0

Write dataframe to csv with datatype map<string,bigint> in Spark

阅读更多关于 Write dataframe to csv with datatype map in Spark

问题 I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying "Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type." Code which I have used: val conf=new SparkConf().setAppName("student-example").setMaster("local") val sc = new SparkContext(conf) val sqlcontext =

spark调优（一）：开发调优

阅读更多关于 spark调优（一）：开发调优

转发学习自美团技术团队： https://tech.meituan.com 文章总体概览优化开发原则： 1、避免创建重复rdd 2、尽可能复用rdd：避免重复计算 3、合适的持久化策略： memory_only memory_only_ser memory_and_disk等 4、尽量避免shuffle算子：如reduceByKey、join、distinct等，主要为了减少磁盘、网络IO。在 join操作中使用小表广播变量可以避免shuffle。 5、map端预聚合：在无法避免shuffle操作的情况下采用map端预聚合，当然前提是不影响业务逻辑。如使用reduceByKey、aggregateByKey替代groupByKey。 6、使用高性能算子：如mapPartitions替代map、foreachPartiton替代foreach（特别是在数据库操作中）、filter之后进行coalesce操作缩减分区、使用reparationAndSortWithinPartition替代reparation和sort操作。 7、广播大变量 8、使用kyro序列化：比默认的java序列化性能高10倍左右，rdd使用kyro序列化需要进行注册、但在使用DF、DS结构时默认会使用kyro序列化。 9、优化数据结构正文： 1、前言在大数据计算领域

PySpark: Convert a pair RDD back to a regular RDD

阅读更多关于 PySpark: Convert a pair RDD back to a regular RDD

问题 Is there any way I can convert a pair RDD back to a regular RDD? Suppose I get a local csv file, and I first load it as a regular rdd rdd = sc.textFile("$path/$csv") Then I create a pair rdd (i.e. key is the string before "," and value is the string after ",") pairRDD = rdd.map(lambda x : (x.split(",")[0], x.split(",")[1])) I store the pairRDD by using the saveAsTextFile() pairRDD.saveAsTextFile("$savePath") However, as investigated, the stored file will contain some necessary characters,

Ordered union on spark RDDs

阅读更多关于 Ordered union on spark RDDs

问题 I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have