apache-spark-mllib

Use of foreachActive for spark Vector in Java

落爺英雄遲暮 提交于 2019-12-04 20:32:08
How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three different imports but always failed). If you use Function2 please provide necessary import. Adrian, here is how you can use the foreachActive method on the sparse Vector AbstractFunction2<Object, Object, BoxedUnit> f = new AbstractFunction2<Object, Object, BoxedUnit>() { public BoxedUnit apply(Object t1, Object t2) { System.out.println("Index:" + t1 + "

Understanding Spark MLlib LDA input format

雨燕双飞 提交于 2019-12-04 19:14:16
I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 I followed http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda I understand the

PCA in Spark MLlib and Spark ML

笑着哭i 提交于 2019-12-04 18:39:22
问题 Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any

how to keep records information when working in Mllib

僤鯓⒐⒋嵵緔 提交于 2019-12-04 18:18:07
I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features> . When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

二次信任 提交于 2019-12-04 18:05:24
Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build() val cv = new CrossValidator()

Matrix Operation in Spark MLlib in Java

巧了我就是萌 提交于 2019-12-04 13:17:26
This question is about MLlib (Spark 1.2.1+). What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed). For instance, after computing the SVD of a dataset, I need to perform some matrix operation. The RowMatrix only provide a multiply function. The toBreeze method returns a DenseMatrix<Object> but the API does not seem Java friendly: public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op) In Spark+Java, how to do any of the following operations: transpose a matrix add/subtract two matrices crop a Matrix perform

How to use RandomForest in Spark Pipeline

对着背影说爱祢 提交于 2019-12-04 13:16:26
问题 I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice? Thanks 回答1: However, the RandomForest model cannot be new by

Naive-bayes multinomial text classifier using Data frame in Scala Spark

让人想犯罪 __ 提交于 2019-12-04 12:29:55
I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words") val regexTokenizer = new RegexTokenizer()

How to handle categorical features for Decision Tree, Random Forest in spark ml?

流过昼夜 提交于 2019-12-04 10:38:34
I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing . There are many categorical features (having string values) in the data set. In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a

How to convert a map to Spark's RDD

有些话、适合烂在心里 提交于 2019-12-04 10:09:45
I have a data set which is in the form of some nested maps, and its Scala type is: Map[String, (LabelType,Map[Int, Double])] The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample. I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms. It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark: writeMapToLibSVMFile