apache-spark

Multiply SparseVectors element-wise

匆匆过客 提交于 2021-02-08 08:14:14
问题 I am having 2RDD and I want to multiply element-wise between these 2 rdd. Lets say that I am having the following RDD(example): a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55])) aRDD = sc.parallelize(a) b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0])) bRDD = sc.parallelize(b) It can be seen that b is sparse and I want to avoid multiply a zero value with another value. I am doing the following: from pyspark.mllib.linalg import Vectors def create_sparce_matrix(a_list): length = len(a_list

add columns in dataframes dynamically with column names as elements in List

青春壹個敷衍的年華 提交于 2021-02-08 08:06:42
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

add columns in dataframes dynamically with column names as elements in List

别等时光非礼了梦想. 提交于 2021-02-08 08:06:03
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

How to read many tables from the same database and save them to their own CSV file?

人盡茶涼 提交于 2021-02-08 08:01:32
问题 Below is a working code to connect to a SQL server,and save 1 table to a CSV format file. conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option(

Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

假装没事ソ 提交于 2021-02-08 07:57:43
问题 I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below. root |-- id: string (nullable = true) |-- desc: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- age: long (nullable = false) The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6: An example would be: Key . Value 1010

generating join condition dynamically in spark/scala

*爱你&永不变心* 提交于 2021-02-08 07:56:37
问题 I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like. Here's how I am doing it right now. Although it works, I think its not clean. val testInput =Array("a=b", "c=d") val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b)) firstDataFrame.join(secondDataFrame, condition, "fullouter") Here's the testMethod def testMethod(inputString:

MapWithStateRDDRecord with kryo

旧城冷巷雨未停 提交于 2021-02-08 07:54:49
问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

MapWithStateRDDRecord with kryo

[亡魂溺海] 提交于 2021-02-08 07:54:05
问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

Creating combination of value list with existing key - Pyspark

蹲街弑〆低调 提交于 2021-02-08 07:45:03
问题 So my rdd consists of data looking like: (k, [v1,v2,v3...]) I want to create a combination of all sets of two for the value part. So the end map should look like: (k1, (v1,v2)) (k1, (v1,v3)) (k1, (v2,v3)) I know to get the value part, I would use something like rdd.cartesian(rdd).filter(case (a,b) => a < b) However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby. Also, ultimately, I want to get

Creating combination of value list with existing key - Pyspark

时光毁灭记忆、已成空白 提交于 2021-02-08 07:44:32
问题 So my rdd consists of data looking like: (k, [v1,v2,v3...]) I want to create a combination of all sets of two for the value part. So the end map should look like: (k1, (v1,v2)) (k1, (v1,v3)) (k1, (v2,v3)) I know to get the value part, I would use something like rdd.cartesian(rdd).filter(case (a,b) => a < b) However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby. Also, ultimately, I want to get