apache-spark-mllib

Using PCA before Bayes classificition

老子叫甜甜 提交于 2019-12-12 03:47:31
问题 I'm trying using PCA before bayes classification ,but it says Native Bayes requires nonnegative features values, the training data used is nonnegative, but it turns to negative using PCA, how do I fix it ,Thanks to answer my question 回答1: If you want to reduce the dimension of your inputs, you can use nonnegative matrix factorization instead. In Spark, this method is in mllib.recommendation.ALS and then set the nonnegative parameter as True . 来源: https://stackoverflow.com/questions/36491852

MLLib spark -ALStrainImplicit value more than 1 [duplicate]

蓝咒 提交于 2019-12-12 03:36:23
问题 This question already has an answer here : Spark ALS recommendation system have value prediction greater than 1 (1 answer) Closed 7 months ago . Experimenting with Spark mllib ALS("trainImplicit") for a while now. Would like to understand 1.why Im getting ratings value more than 1 in the predictions? 2.Is there any need for normalizing the user-product input? sample result: [Rating(user=316017, product=114019, rating=3.1923), Rating(user=316017, product=41930, rating=2.0146997092620897) ] In

Equivalent of mllib.DecisionTreeModel.toDebugString() in ml.DecisionTreeClassificationModel

爱⌒轻易说出口 提交于 2019-12-12 03:22:41
问题 As the question says, is there any equivalent of Spark org.apache.spark.mllib.tree.model.DecisionTreeClassificationModel.toDebugString() in org.apache.spark.ml.classification.DecisionTreeClassificationModel I have gone through the API doc of the latter and found this method rootNode() which gives back a org.apache.spark.ml.tree.Node object which seems to be a recursive object, so should I use this class instead to build the tree structure myself? Thanks in anticipation. 回答1: org.apache.spark

How to convert RDD[Row] to RDD[Vector]

大憨熊 提交于 2019-12-12 03:06:43
问题 I'm trying to implement k-means method using scala. I created a RDD something like that val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> { sc.parallelize(chunk._2.toSeq).toDF() }) val examples = df.map(dataframe =>{ dataframe.selectExpr( "avg(time) as avg_time", "variance(size) as var_size", "variance(time) as var_time", "count(size) as examples" ).rdd }) val rdd_final=examples.reduce(_ union _) val kmeans= new KMeans() val model = kmeans.run(rdd_final) With this code I

pyspark add new column field with the data frame row number

心不动则不痛 提交于 2019-12-12 02:54:09
问题 Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need

How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

白昼怎懂夜的黑 提交于 2019-12-12 01:58:14
问题 Hi All & IBM Bluemix team, I am using IBM Analytics for Apache Spark service in IBM Bluemix. I have developed a Apache Spark application and I want to run everyday at 00.30 AM in the night. How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix? 回答1: You can use any scheduling tool like (Crontab on linux) that will allow you to run spark-submit.sh script from your machine at a specific time.(in your case 00:30am) A typical crontab entry would look like

Why is my Spark SVM always predicting the same label?

我怕爱的太早我们不能终老 提交于 2019-12-11 12:23:17
问题 I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong. I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't

Distributed BlockMatrix out of Spark Matrices

梦想与她 提交于 2019-12-11 12:14:11
问题 How to make a distributed BlockMatrix out of Matrices (of the same size)? For example, let A, B be two 2 by 2 mllib.linalg.Matrices as follows import org.apache.spark.mllib.linalg.{Matrix, Matrices} import org.apache.spark.mllib.linalg.distributed.BlockMatrix val A: Matrix = Matrices.dense(2, 2, Array(1.0, 2.0, 3.0, 4.0)) val B: Matrix = Matrices.dense(2, 2, Array(5.0, 6.0, 7.0, 8.0)) val C = new BlockMatrix(???) How can I first make an RDD[((Int, Int), Matrix)] from A, B and second a

How to provide multiple columns to setInputCol()

岁酱吖の 提交于 2019-12-11 07:37:46
问题 I am very new to Spark Machine Learning I want to pass multiple columns to features, in my below code I am only passing the Date column to features but now I want to pass Userid and Date columns to features. I tried to Use Vector but It only support Double data type but in My case I have Int and String I would be thankful if anyone provide any suggestion/solution or any code example which will fulfill my requirement Code: case class LabeledDocument(Userid: Double, Date: String, label: Double)

UDF to check is non zero vector, not working after CountVectorizer through spark-submit

有些话、适合烂在心里 提交于 2019-12-11 07:24:27
问题 As per this question, I am applying udf to filter empty vector after CountVectorizer. val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords") val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features") val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer)) val modelTV = pipelineTV.fit(dataset1) val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType) val