apache-spark-mllib | 易学教程

Get Column Names after columnSimilarties() Spark scala

阅读更多关于 Get Column Names after columnSimilarties() Spark scala

I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add(StructField("item_2", DoubleType, true)) .add(StructField("item_3", DoubleType, true)) // Data frame val df =

How to handle categorical features for Decision Tree, Random Forest in spark ml?

阅读更多关于 How to handle categorical features for Decision Tree, Random Forest in spark ml?

问题 I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set. In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features

How to convert a map to Spark's RDD

阅读更多关于 How to convert a map to Spark's RDD

问题 I have a data set which is in the form of some nested maps, and its Scala type is: Map[String, (LabelType,Map[Int, Double])] The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample. I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms. It's easy to write

Why can't I load a PySpark RandomForestClassifier model?

阅读更多关于 Why can't I load a PySpark RandomForestClassifier model?

I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model = RandomForestClassificationModel.load("/tmp/models/20161029-RF-topics-cats.model") gives: Py4JJavaError: An error occurred

Spark MLLib 2.0 Categorical Features in pipeline

阅读更多关于 Spark MLLib 2.0 Categorical Features in pipeline

I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier However I get the following error:

Spark LDA woes - prediction and OOM questions

阅读更多关于 Spark LDA woes - prediction and OOM questions

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents

SPARK, ML, Tuning, CrossValidator: access the metrics

阅读更多关于 SPARK, ML, Tuning, CrossValidator: access the metrics

问题 In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access

How to calculate p-values in Spark's Logistic Regression?

阅读更多关于 How to calculate p-values in Spark's Logistic Regression?

We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is deprecated . In that version, no true set of "summary" information was provided with the model itself. If

Using Breeze from Java on Spark MLlib

阅读更多关于 Using Breeze from Java on Spark MLlib

While trying to use MLlib from Java, what is the correct way to use breeze Matrix operations? For e.g. multiplication in scala it ist simply " matrix * vector ". How is the corresponding functionality expressed in Java? There are methods like " $colon$times " which might be invoked by the correct way breeze.linalg.DenseMatrix<Double> matrix= ... breeze.linalg.DenseVector<Double> vector = ... matrix.$colon$times( ... one might need an operator instance ... breeze.linalg.operators.OpMulMatrix.Impl2 But which exact typed Operation instance and parameters are to be used? It's honestly very hard.

How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

阅读更多关于 How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

问题 I am trying to implement KMeans using Apache Spark . val data = sc.textFile(irisDatasetString) val parsedData = data.map(_.split(',').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData,3,numIterations = 20) on which I get the following error : error: overloaded method value train with alternatives: (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and> (data: org.apache.spark