apache-spark-mllib

Get Column Names after columnSimilarties() Spark scala

a 夏天 提交于 2019-12-06 07:25:17
I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add(StructField("item_2", DoubleType, true)) .add(StructField("item_3", DoubleType, true)) // Data frame val df =

How to handle categorical features for Decision Tree, Random Forest in spark ml?

倖福魔咒の 提交于 2019-12-06 05:35:51
问题 I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set. In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features

How to convert a map to Spark's RDD

北战南征 提交于 2019-12-06 01:34:13
问题 I have a data set which is in the form of some nested maps, and its Scala type is: Map[String, (LabelType,Map[Int, Double])] The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample. I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms. It's easy to write

Why can't I load a PySpark RandomForestClassifier model?

只谈情不闲聊 提交于 2019-12-06 00:31:52
I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model = RandomForestClassificationModel.load("/tmp/models/20161029-RF-topics-cats.model") gives: Py4JJavaError: An error occurred

Spark MLLib 2.0 Categorical Features in pipeline

不问归期 提交于 2019-12-05 22:49:47
I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier However I get the following error:

Spark LDA woes - prediction and OOM questions

安稳与你 提交于 2019-12-05 21:43:28
I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents

SPARK, ML, Tuning, CrossValidator: access the metrics

早过忘川 提交于 2019-12-05 20:17:31
问题 In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access

How to calculate p-values in Spark's Logistic Regression?

可紊 提交于 2019-12-05 20:07:15
We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is deprecated . In that version, no true set of "summary" information was provided with the model itself. If

Using Breeze from Java on Spark MLlib

自古美人都是妖i 提交于 2019-12-05 18:44:32
While trying to use MLlib from Java, what is the correct way to use breeze Matrix operations? For e.g. multiplication in scala it ist simply " matrix * vector ". How is the corresponding functionality expressed in Java? There are methods like " $colon$times " which might be invoked by the correct way breeze.linalg.DenseMatrix<Double> matrix= ... breeze.linalg.DenseVector<Double> vector = ... matrix.$colon$times( ... one might need an operator instance ... breeze.linalg.operators.OpMulMatrix.Impl2 But which exact typed Operation instance and parameters are to be used? It's honestly very hard.

How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

▼魔方 西西 提交于 2019-12-05 18:14:19
问题 I am trying to implement KMeans using Apache Spark . val data = sc.textFile(irisDatasetString) val parsedData = data.map(_.split(',').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData,3,numIterations = 20) on which I get the following error : error: overloaded method value train with alternatives: (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and> (data: org.apache.spark