apache-spark | 易学教程

Multiply SparseVectors element-wise

阅读更多关于 Multiply SparseVectors element-wise

问题 I am having 2RDD and I want to multiply element-wise between these 2 rdd. Lets say that I am having the following RDD(example): a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55])) aRDD = sc.parallelize(a) b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0])) bRDD = sc.parallelize(b) It can be seen that b is sparse and I want to avoid multiply a zero value with another value. I am doing the following: from pyspark.mllib.linalg import Vectors def create_sparce_matrix(a_list): length = len(a_list

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

How to read many tables from the same database and save them to their own CSV file?

阅读更多关于 How to read many tables from the same database and save them to their own CSV file?

问题 Below is a working code to connect to a SQL server,and save 1 table to a CSV format file. conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option(

Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

阅读更多关于 Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

问题 I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below. root |-- id: string (nullable = true) |-- desc: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- age: long (nullable = false) The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6: An example would be: Key . Value 1010

generating join condition dynamically in spark/scala

阅读更多关于 generating join condition dynamically in spark/scala

问题 I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like. Here's how I am doing it right now. Although it works, I think its not clean. val testInput =Array("a=b", "c=d") val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b)) firstDataFrame.join(secondDataFrame, condition, "fullouter") Here's the testMethod def testMethod(inputString:

MapWithStateRDDRecord with kryo

阅读更多关于 MapWithStateRDDRecord with kryo

问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

MapWithStateRDDRecord with kryo

阅读更多关于 MapWithStateRDDRecord with kryo

Creating combination of value list with existing key - Pyspark

阅读更多关于 Creating combination of value list with existing key - Pyspark

问题 So my rdd consists of data looking like: (k, [v1,v2,v3...]) I want to create a combination of all sets of two for the value part. So the end map should look like: (k1, (v1,v2)) (k1, (v1,v3)) (k1, (v2,v3)) I know to get the value part, I would use something like rdd.cartesian(rdd).filter(case (a,b) => a < b) However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby. Also, ultimately, I want to get

Creating combination of value list with existing key - Pyspark

阅读更多关于 Creating combination of value list with existing key - Pyspark