apache-spark-ml

ALS model - how to generate full_u * v^t * v?

做~自己de王妃 提交于 2019-11-27 05:12:44
I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer . I've copied the answer below for the reader's convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives

Spark, DataFrame: apply transformer/estimator on groups

点点圈 提交于 2019-11-27 04:38:31
问题 I have a DataFrame that looks like follow: +-----------+-----+------------+ | userID|group| features| +-----------+-----+------------+ |12462563356| 1| [5.0,43.0]| |12462563701| 2| [1.0,8.0]| |12462563701| 1| [2.0,12.0]| |12462564356| 1| [1.0,1.0]| |12462565487| 3| [2.0,3.0]| |12462565698| 2| [1.0,1.0]| |12462565698| 1| [1.0,1.0]| |12462566081| 2| [1.0,2.0]| |12462566081| 1| [1.0,15.0]| |12462566225| 2| [1.0,1.0]| |12462566225| 1| [9.0,85.0]| |12462566526| 2| [1.0,1.0]| |12462566526| 1| [3.0

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

半腔热情 提交于 2019-11-27 03:54:14
I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out. I can get this to work for an explicit data model with a regression RMSE evaluator, as follows: from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import

Feature normalization algorithm in Spark

我的梦境 提交于 2019-11-27 02:03:43
问题 Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.

What's the difference between Spark ML and MLLIB packages

血红的双手。 提交于 2019-11-27 02:03:02
问题 I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? 回答1: o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML

How to find mean of grouped Vector columns in Spark SQL?

寵の児 提交于 2019-11-27 02:02:25
I have created a RelationalGroupedDataset by calling instances.groupBy(instances.col("property_name")) : val x = instances.groupBy(instances.col("property_name")) How do I compose a user-defined aggregate function to perform Statistics.colStats().mean on each group? Thanks! user6910411 Spark >= 2.4 You can use Summarizer : import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF("group", "features") dfNew .groupBy($"group") .agg(Summarizer.mean($"features").alias("means")) .show(false) +-----+---

How to merge multiple feature vectors in DataFrame?

无人久伴 提交于 2019-11-27 01:57:12
问题 Using Spark ML transformers I arrived at a DataFrame where each row looks like this: Row(object_id, text_features_vector, color_features, type_features) where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types. What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the

Spark Structured Streaming and Spark-Ml Regression

六月ゝ 毕业季﹏ 提交于 2019-11-26 23:39:45
问题 Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources. How I'm supposed to apply regressions on structured streaming sources? (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink) 回答1: Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured

Attach metadata to vector column in Spark

不问归期 提交于 2019-11-26 23:03:57
问题 Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to assign a schema to the features vector? I want to keep track of the name of each feature. Tried so far: val defaultAttr = NumericAttribute.defaultAttr val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName) val attrGroup = new

How to get word details from TF Vector RDD in Spark ML Lib?

断了今生、忘了曾经 提交于 2019-11-26 22:41:24
I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...] ,[termFrequencyofWord1, termFrequencyOfWord2 ....] eg: (1048576,[105,3116],[1.0,2.0]) I am able to get the index in hash bucket, using tf.indexOf("word") . But, how can I get the word using the index? zero323 Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is