apache-spark-ml | 易学教程

ALS model - how to generate full_u * v^t * v?

阅读更多关于 ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer . I've copied the answer below for the reader's convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives

Spark, DataFrame: apply transformer/estimator on groups

阅读更多关于 Spark, DataFrame: apply transformer/estimator on groups

问题 I have a DataFrame that looks like follow: +-----------+-----+------------+ | userID|group| features| +-----------+-----+------------+ |12462563356| 1| [5.0,43.0]| |12462563701| 2| [1.0,8.0]| |12462563701| 1| [2.0,12.0]| |12462564356| 1| [1.0,1.0]| |12462565487| 3| [2.0,3.0]| |12462565698| 2| [1.0,1.0]| |12462565698| 1| [1.0,1.0]| |12462566081| 2| [1.0,2.0]| |12462566081| 1| [1.0,15.0]| |12462566225| 2| [1.0,1.0]| |12462566225| 1| [9.0,85.0]| |12462566526| 2| [1.0,1.0]| |12462566526| 1| [3.0

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

阅读更多关于 Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out. I can get this to work for an explicit data model with a regression RMSE evaluator, as follows: from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import

Feature normalization algorithm in Spark

阅读更多关于 Feature normalization algorithm in Spark

问题 Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.

What's the difference between Spark ML and MLLIB packages

阅读更多关于 What's the difference between Spark ML and MLLIB packages

问题 I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? 回答1: o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML

How to find mean of grouped Vector columns in Spark SQL?

阅读更多关于 How to find mean of grouped Vector columns in Spark SQL?

I have created a RelationalGroupedDataset by calling instances.groupBy(instances.col("property_name")) : val x = instances.groupBy(instances.col("property_name")) How do I compose a user-defined aggregate function to perform Statistics.colStats().mean on each group? Thanks! user6910411 Spark >= 2.4 You can use Summarizer : import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF("group", "features") dfNew .groupBy($"group") .agg(Summarizer.mean($"features").alias("means")) .show(false) +-----+---

How to merge multiple feature vectors in DataFrame?

阅读更多关于 How to merge multiple feature vectors in DataFrame?

问题 Using Spark ML transformers I arrived at a DataFrame where each row looks like this: Row(object_id, text_features_vector, color_features, type_features) where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types. What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the

Spark Structured Streaming and Spark-Ml Regression

阅读更多关于 Spark Structured Streaming and Spark-Ml Regression

问题 Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources. How I'm supposed to apply regressions on structured streaming sources? (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink) 回答1: Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured

Attach metadata to vector column in Spark

阅读更多关于 Attach metadata to vector column in Spark

问题 Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to assign a schema to the features vector? I want to keep track of the name of each feature. Tried so far: val defaultAttr = NumericAttribute.defaultAttr val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName) val attrGroup = new

How to get word details from TF Vector RDD in Spark ML Lib?

阅读更多关于 How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...] ,[termFrequencyofWord1, termFrequencyOfWord2 ....] eg: (1048576,[105,3116],[1.0,2.0]) I am able to get the index in hash bucket, using tf.indexOf("word") . But, how can I get the word using the index? zero323 Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is