apache-spark | 易学教程

How to make VectorAssembler do not compress data?

阅读更多关于 How to make VectorAssembler do not compress data?

问题 I want to transform multiple columns to one column using VectorAssembler ,but the data is compressed by default without other options. val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6)) val df=sc.parallelize(arr2).toDF("a","b","c","e","f") val colNames=Array("a","b","c","e","f") val assembler = new VectorAssembler() .setInputCols(colNames) .setOutputCol("newCol") val transDF= assembler.transform(df).select(col("newCol")) transDF.show(false) The input is: +---+---+---+---+---+ |

How to force Spark to only execute a transformation once?

阅读更多关于 How to force Spark to only execute a transformation once?

问题 I have a spark job which samples my input data randomly. Then, I generate a bloom filter for the input data. Finally, I apply the filter and join the data with dataset A. Since the sampling is random, it should only be executed only once. But it executes twice even if I persist it. I can see a green cache step in Spark DAG of the first step but the join still starts from data loading and random sampling. I also found the cached data can be evited when workers are running out of memory, which

set HBase properties for Spark Job using spark-submit

阅读更多关于 set HBase properties for Spark Job using spark-submit

问题 During Hbase data migration I have encountered a java.lang.IllegalArgumentException: KeyValue size too large In long term : I need to increase the properties hbase.client.keyvalue.maxsize (from 1048576 to 10485760) in the /etc/hbase/conf/hbase-site.xml but I can't change this file now (I need validation). In short term : I have success to import data using command : hbase org.apache.hadoop.hbase.mapreduce.Import \ -Dhbase.client.keyvalue.maxsize=10485760 \ myTable \ myBackupFile Now I need to

How to implement a written code in aggregateMessages in Pregel API in Spark?

阅读更多关于 How to implement a written code in aggregateMessages in Pregel API in Spark?

问题 I have implemented computing a similarity between nodes of a graph in aggregateMessages . during this, the intersection or common neighbors between two nodes is computed and sent to both of them. the message is a double number. each node receives it and sum it up to calculate the similarity sum for itself. the similarity is known as Jaccard similarity. i have graph that's structure look like this: (vertexID, List[neighbors ID]) (vertexID, List[neighbors ID]) (vertexID, List[neighbors ID]) ...

Spark Streaming + Kafka Integration : Support new topic subscriptions without requiring restart of the streaming context

阅读更多关于 Spark Streaming + Kafka Integration : Support new topic subscriptions without requiring restart of the streaming context

问题 I am using a spark streaming application(spark 2.1) to consume data from kafka(0.10.1) topics. I want to subscribe to new topic without restarting the streaming context . Is there any way to achieve this? I can see a jira ticket in apache spark project for the same (https://issues.apache.org/jira/browse/SPARK-10320),Even though it is closed in 2.0 version, I couldn't find any documentation or example to do this. If any of you are familiar with this, please provide me documentation link or

Sum one column values if other columns are matched

阅读更多关于 Sum one column values if other columns are matched

问题 I have a spark dataframe like this: word1 word2 co-occur ---- ----- ------- w1 w2 10 w2 w1 15 w2 w3 11 And my expected result is: word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11 I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution. 回答1: You need a single column containing both words in sorted order, this column can then be used for the groupBy . You can create a new column with an array containing word1 and word as follows: df.withColumn(

Task data locality NO_PREF. When is it used?

阅读更多关于 Task data locality NO_PREF. When is it used?

问题 According to Spark doc, there are 5 levels of data locality: PROCESS_LOCAL NODE_LOCAL NO_PREF RACK_LOCAL ANY All of them are pretty clear to me apart NO_PREF (from Spark doc: " data is accessed equally quickly from anywhere and has no locality preference ") What is the case NO_PREF whould be used? 回答1: One of the RDD characteristics is preferred locations. For example if RDD source is an HDFS file, preferred location should contain data nodes where data is physically located. But if there is

How can I get a distinct RDD of dicts in PySpark?

阅读更多关于 How can I get a distinct RDD of dicts in PySpark?

问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD

what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

阅读更多关于 what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

问题 What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ? 回答1: What is the cluster manager used in Databricks? Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes: Fully managed Spark clusters An interactive workspace for exploration and visualization A platform for powering your favorite Spark-based applications The Databricks Runtime is built on top of Apache Spark and is

DataFrame using UDF giving Task not serializable Exception

阅读更多关于 DataFrame using UDF giving Task not serializable Exception

问题 Trying to use the show() method on a dataframe. It is giving Task not serializable Exception. I have tried to extend the Serializable object but still the error persists. object App extends Serializable{ def main(args: Array[String]): Unit = { Logger.getLogger("org.apache").setLevel(Level.WARN); val spark = SparkSession.builder() .appName("LearningSpark") .master("local[*]") .getOrCreate() val sc = spark.sparkContext val inputPath = "./src/resources/2015-03-01-0.json" val ghLog = spark.read