apache-spark-mllib

Feature normalization algorithm in Spark

我的梦境 提交于 2019-11-27 02:03:43
问题 Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.

What's the difference between Spark ML and MLLIB packages

血红的双手。 提交于 2019-11-27 02:03:02
问题 I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? 回答1: o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML

How to update Spark MatrixFactorizationModel for ALS

点点圈 提交于 2019-11-27 00:47:32
问题 I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html. I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't. While this is ok for me by now, im curious on how to update a model. While my current solution

How to use the PySpark CountVectorizer on columns that maybe null

爱⌒轻易说出口 提交于 2019-11-26 23:42:43
问题 I have a column in my Spark DataFrame: |-- topics_A: array (nullable = true) | |-- element: string (containsNull = true) I'm using CountVectorizer on it: topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A") I get NullPointerExceptions, because sometimes the topic_A column contains null. Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on

using Word2VecModel.transform() does not work in map function

女生的网名这么多〃 提交于 2019-11-26 23:23:50
问题 I have built a Word2Vec model using Spark and save it as a model. Now, I want to use it in another code as offline model. I have loaded the model and used it to present vector of a word (e.g. Hello) and it works well. But, I need to call it for many words in an RDD using map. When I call model.transform() in a map function, it throws this error: "It appears that you are attempting to reference SparkContext from a broadcast " Exception: It appears that you are attempting to reference

Attach metadata to vector column in Spark

不问归期 提交于 2019-11-26 23:03:57
问题 Context: I have a data frame with two columns: label, and features. org.apache.spark.sql.DataFrame = [label: int, features: vector] Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler. Question: Is there a way to assign a schema to the features vector? I want to keep track of the name of each feature. Tried so far: val defaultAttr = NumericAttribute.defaultAttr val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName) val attrGroup = new

How to get word details from TF Vector RDD in Spark ML Lib?

断了今生、忘了曾经 提交于 2019-11-26 22:41:24
I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...] ,[termFrequencyofWord1, termFrequencyOfWord2 ....] eg: (1048576,[105,3116],[1.0,2.0]) I am able to get the index in hash bucket, using tf.indexOf("word") . But, how can I get the word using the index? zero323 Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

流过昼夜 提交于 2019-11-26 22:31:44
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. Jason Lenderman As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments:

Run ML algorithm inside map function in Spark

走远了吗. 提交于 2019-11-26 22:10:53
问题 So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error: AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized? Obviously I cannot reference SparkContext inside the apply_classifier function. My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for: def

Why does StandardScaler not attach metadata to the output column?

社会主义新天地 提交于 2019-11-26 22:00:14
问题 I noticed that the ml StandardScaler does not attach metadata to the output column: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature._ val df = spark.read.option("header", true) .option("inferSchema", true) .csv("/path/to/cars.data") val strId1 = new StringIndexer() .setInputCol("v7") .setOutputCol("v7_IDX") val strId2 = new StringIndexer() .setInputCol("v8") .setOutputCol("v8_IDX") val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(Array("v0",