apache-spark-mllib | 易学教程

Calculating Standard Error of Coefficients for Logistic Regression in Spark

阅读更多关于 Calculating Standard Error of Coefficients for Logistic Regression in Spark

问题 I know this question has been asked previously here. But I couldn't find the correct answer. The answer provided in the previous post suggests the usage of Statistics.chiSqTest(data) which provides the goodness of fit test (Pearson's Chi-Square tests), not the Wald Chi-Square tests for significance of coefficients. I was trying to build the parameter estimate table for logistic regression in Spark. I was able to get the coefficients and intercepts, but I couldn't find the spark API to get the

Why can't I load a PySpark RandomForestClassifier model?

阅读更多关于 Why can't I load a PySpark RandomForestClassifier model?

问题 I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model =

Using Breeze from Java on Spark MLlib

阅读更多关于 Using Breeze from Java on Spark MLlib

问题 While trying to use MLlib from Java, what is the correct way to use breeze Matrix operations? For e.g. multiplication in scala it ist simply " matrix * vector ". How is the corresponding functionality expressed in Java? There are methods like " $colon$times " which might be invoked by the correct way breeze.linalg.DenseMatrix<Double> matrix= ... breeze.linalg.DenseVector<Double> vector = ... matrix.$colon$times( ... one might need an operator instance ... breeze.linalg.operators.OpMulMatrix

Spark MLlib / K-Means intuition

阅读更多关于 Spark MLlib / K-Means intuition

问题 I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls

Convert Sparse Vector to Dense Vector in Pyspark

阅读更多关于 Convert Sparse Vector to Dense Vector in Pyspark

问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Printing ClusterID and its elements using Spark KMeans algo.

阅读更多关于 Printing ClusterID and its elements using Spark KMeans algo.

问题 I have this program which prints the MSSE of Kmeans algorithm on apache-spark. There are 20 clusters generated. I am trying to print the clusterID and the elements that got assigned to respective clusterID. How do i loop over the clusterID to print the elements. Thank you guys!! val sc = new SparkContext("local", "KMeansExample","/usr/local/spark/", List("target/scala-2.10/kmeans_2.10-1.0.jar")) // Load and parse the data val data = sc.textFile("kmeans.csv") val parsedData = data.map( s =>

Naive-bayes multinomial text classifier using Data frame in Scala Spark

阅读更多关于 Naive-bayes multinomial text classifier using Data frame in Scala Spark

问题 I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new

How to pass params to a ML Pipeline.fit method?

阅读更多关于 How to pass params to a ML Pipeline.fit method?

问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I

Matrix Operation in Spark MLlib in Java

阅读更多关于 Matrix Operation in Spark MLlib in Java

问题 This question is about MLlib (Spark 1.2.1+). What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed). For instance, after computing the SVD of a dataset, I need to perform some matrix operation. The RowMatrix only provide a multiply function. The toBreeze method returns a DenseMatrix<Object> but the API does not seem Java friendly: public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op) In Spark+Java, how to do