apache-spark

How to make VectorAssembler do not compress data?

僤鯓⒐⒋嵵緔 提交于 2021-01-28 05:32:51
问题 I want to transform multiple columns to one column using VectorAssembler ,but the data is compressed by default without other options. val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6)) val df=sc.parallelize(arr2).toDF("a","b","c","e","f") val colNames=Array("a","b","c","e","f") val assembler = new VectorAssembler() .setInputCols(colNames) .setOutputCol("newCol") val transDF= assembler.transform(df).select(col("newCol")) transDF.show(false) The input is: +---+---+---+---+---+ |

How to force Spark to only execute a transformation once?

天涯浪子 提交于 2021-01-28 05:30:30
问题 I have a spark job which samples my input data randomly. Then, I generate a bloom filter for the input data. Finally, I apply the filter and join the data with dataset A. Since the sampling is random, it should only be executed only once. But it executes twice even if I persist it. I can see a green cache step in Spark DAG of the first step but the join still starts from data loading and random sampling. I also found the cached data can be evited when workers are running out of memory, which

set HBase properties for Spark Job using spark-submit

一曲冷凌霜 提交于 2021-01-28 05:26:16
问题 During Hbase data migration I have encountered a java.lang.IllegalArgumentException: KeyValue size too large In long term : I need to increase the properties hbase.client.keyvalue.maxsize (from 1048576 to 10485760) in the /etc/hbase/conf/hbase-site.xml but I can't change this file now (I need validation). In short term : I have success to import data using command : hbase org.apache.hadoop.hbase.mapreduce.Import \ -Dhbase.client.keyvalue.maxsize=10485760 \ myTable \ myBackupFile Now I need to

How to implement a written code in aggregateMessages in Pregel API in Spark?

天大地大妈咪最大 提交于 2021-01-28 05:20:04
问题 I have implemented computing a similarity between nodes of a graph in aggregateMessages . during this, the intersection or common neighbors between two nodes is computed and sent to both of them. the message is a double number. each node receives it and sum it up to calculate the similarity sum for itself. the similarity is known as Jaccard similarity. i have graph that's structure look like this: (vertexID, List[neighbors ID]) (vertexID, List[neighbors ID]) (vertexID, List[neighbors ID]) ...

Spark Streaming + Kafka Integration : Support new topic subscriptions without requiring restart of the streaming context

核能气质少年 提交于 2021-01-28 05:16:07
问题 I am using a spark streaming application(spark 2.1) to consume data from kafka(0.10.1) topics. I want to subscribe to new topic without restarting the streaming context . Is there any way to achieve this? I can see a jira ticket in apache spark project for the same (https://issues.apache.org/jira/browse/SPARK-10320),Even though it is closed in 2.0 version, I couldn't find any documentation or example to do this. If any of you are familiar with this, please provide me documentation link or

Sum one column values if other columns are matched

被刻印的时光 ゝ 提交于 2021-01-28 05:13:49
问题 I have a spark dataframe like this: word1 word2 co-occur ---- ----- ------- w1 w2 10 w2 w1 15 w2 w3 11 And my expected result is: word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11 I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution. 回答1: You need a single column containing both words in sorted order, this column can then be used for the groupBy . You can create a new column with an array containing word1 and word as follows: df.withColumn(

Task data locality NO_PREF. When is it used?

我怕爱的太早我们不能终老 提交于 2021-01-28 04:12:06
问题 According to Spark doc, there are 5 levels of data locality: PROCESS_LOCAL NODE_LOCAL NO_PREF RACK_LOCAL ANY All of them are pretty clear to me apart NO_PREF (from Spark doc: " data is accessed equally quickly from anywhere and has no locality preference ") What is the case NO_PREF whould be used? 回答1: One of the RDD characteristics is preferred locations. For example if RDD source is an HDFS file, preferred location should contain data nodes where data is physically located. But if there is

How can I get a distinct RDD of dicts in PySpark?

人走茶凉 提交于 2021-01-28 04:10:20
问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD

what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

孤街醉人 提交于 2021-01-28 04:09:19
问题 What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ? 回答1: What is the cluster manager used in Databricks? Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes: Fully managed Spark clusters An interactive workspace for exploration and visualization A platform for powering your favorite Spark-based applications The Databricks Runtime is built on top of Apache Spark and is

DataFrame using UDF giving Task not serializable Exception

大憨熊 提交于 2021-01-28 03:40:03
问题 Trying to use the show() method on a dataframe. It is giving Task not serializable Exception. I have tried to extend the Serializable object but still the error persists. object App extends Serializable{ def main(args: Array[String]): Unit = { Logger.getLogger("org.apache").setLevel(Level.WARN); val spark = SparkSession.builder() .appName("LearningSpark") .master("local[*]") .getOrCreate() val sc = spark.sparkContext val inputPath = "./src/resources/2015-03-01-0.json" val ghLog = spark.read