apache-spark-mllib | 易学教程

Using PCA before Bayes classificition

阅读更多关于 Using PCA before Bayes classificition

问题 I'm trying using PCA before bayes classification ,but it says Native Bayes requires nonnegative features values, the training data used is nonnegative, but it turns to negative using PCA, how do I fix it ,Thanks to answer my question 回答1: If you want to reduce the dimension of your inputs, you can use nonnegative matrix factorization instead. In Spark, this method is in mllib.recommendation.ALS and then set the nonnegative parameter as True . 来源： https://stackoverflow.com/questions/36491852

MLLib spark -ALStrainImplicit value more than 1 [duplicate]

阅读更多关于 MLLib spark -ALStrainImplicit value more than 1 [duplicate]

问题 This question already has an answer here : Spark ALS recommendation system have value prediction greater than 1 (1 answer) Closed 7 months ago . Experimenting with Spark mllib ALS("trainImplicit") for a while now. Would like to understand 1.why Im getting ratings value more than 1 in the predictions? 2.Is there any need for normalizing the user-product input? sample result: [Rating(user=316017, product=114019, rating=3.1923), Rating(user=316017, product=41930, rating=2.0146997092620897) ] In

Equivalent of mllib.DecisionTreeModel.toDebugString() in ml.DecisionTreeClassificationModel

阅读更多关于 Equivalent of mllib.DecisionTreeModel.toDebugString() in ml.DecisionTreeClassificationModel

问题 As the question says, is there any equivalent of Spark org.apache.spark.mllib.tree.model.DecisionTreeClassificationModel.toDebugString() in org.apache.spark.ml.classification.DecisionTreeClassificationModel I have gone through the API doc of the latter and found this method rootNode() which gives back a org.apache.spark.ml.tree.Node object which seems to be a recursive object, so should I use this class instead to build the tree structure myself? Thanks in anticipation. 回答1: org.apache.spark

How to convert RDD[Row] to RDD[Vector]

阅读更多关于 How to convert RDD[Row] to RDD[Vector]

问题 I'm trying to implement k-means method using scala. I created a RDD something like that val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> { sc.parallelize(chunk._2.toSeq).toDF() }) val examples = df.map(dataframe =>{ dataframe.selectExpr( "avg(time) as avg_time", "variance(size) as var_size", "variance(time) as var_time", "count(size) as examples" ).rdd }) val rdd_final=examples.reduce(_ union _) val kmeans= new KMeans() val model = kmeans.run(rdd_final) With this code I

pyspark add new column field with the data frame row number

阅读更多关于 pyspark add new column field with the data frame row number

问题 Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need

How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

阅读更多关于 How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

问题 Hi All & IBM Bluemix team, I am using IBM Analytics for Apache Spark service in IBM Bluemix. I have developed a Apache Spark application and I want to run everyday at 00.30 AM in the night. How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix? 回答1: You can use any scheduling tool like (Crontab on linux) that will allow you to run spark-submit.sh script from your machine at a specific time.(in your case 00:30am) A typical crontab entry would look like

Why is my Spark SVM always predicting the same label?

阅读更多关于 Why is my Spark SVM always predicting the same label?

问题 I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong. I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't

Distributed BlockMatrix out of Spark Matrices

阅读更多关于 Distributed BlockMatrix out of Spark Matrices

问题 How to make a distributed BlockMatrix out of Matrices (of the same size)? For example, let A, B be two 2 by 2 mllib.linalg.Matrices as follows import org.apache.spark.mllib.linalg.{Matrix, Matrices} import org.apache.spark.mllib.linalg.distributed.BlockMatrix val A: Matrix = Matrices.dense(2, 2, Array(1.0, 2.0, 3.0, 4.0)) val B: Matrix = Matrices.dense(2, 2, Array(5.0, 6.0, 7.0, 8.0)) val C = new BlockMatrix(???) How can I first make an RDD[((Int, Int), Matrix)] from A, B and second a

How to provide multiple columns to setInputCol()

阅读更多关于 How to provide multiple columns to setInputCol()

问题 I am very new to Spark Machine Learning I want to pass multiple columns to features, in my below code I am only passing the Date column to features but now I want to pass Userid and Date columns to features. I tried to Use Vector but It only support Double data type but in My case I have Int and String I would be thankful if anyone provide any suggestion/solution or any code example which will fulfill my requirement Code: case class LabeledDocument(Userid: Double, Date: String, label: Double)

UDF to check is non zero vector, not working after CountVectorizer through spark-submit

阅读更多关于 UDF to check is non zero vector, not working after CountVectorizer through spark-submit

问题 As per this question, I am applying udf to filter empty vector after CountVectorizer. val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords") val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features") val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer)) val modelTV = pipelineTV.fit(dataset1) val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType) val