apache-spark-mllib

Calculating Standard Error of Coefficients for Logistic Regression in Spark

亡梦爱人 提交于 2019-12-22 18:19:01
问题 I know this question has been asked previously here. But I couldn't find the correct answer. The answer provided in the previous post suggests the usage of Statistics.chiSqTest(data) which provides the goodness of fit test (Pearson's Chi-Square tests), not the Wald Chi-Square tests for significance of coefficients. I was trying to build the parameter estimate table for logistic regression in Spark. I was able to get the coefficients and intercepts, but I couldn't find the spark API to get the

Why can't I load a PySpark RandomForestClassifier model?

痞子三分冷 提交于 2019-12-22 10:50:04
问题 I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model =

Using Breeze from Java on Spark MLlib

北慕城南 提交于 2019-12-22 09:26:19
问题 While trying to use MLlib from Java, what is the correct way to use breeze Matrix operations? For e.g. multiplication in scala it ist simply " matrix * vector ". How is the corresponding functionality expressed in Java? There are methods like " $colon$times " which might be invoked by the correct way breeze.linalg.DenseMatrix<Double> matrix= ... breeze.linalg.DenseVector<Double> vector = ... matrix.$colon$times( ... one might need an operator instance ... breeze.linalg.operators.OpMulMatrix

Spark MLlib / K-Means intuition

此生再无相见时 提交于 2019-12-22 08:42:31
问题 I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls

Convert Sparse Vector to Dense Vector in Pyspark

我的梦境 提交于 2019-12-22 08:10:19
问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

How can I train a random forest with a sparse matrix in Spark?

假如想象 提交于 2019-12-22 07:45:06
问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Printing ClusterID and its elements using Spark KMeans algo.

♀尐吖头ヾ 提交于 2019-12-21 20:24:56
问题 I have this program which prints the MSSE of Kmeans algorithm on apache-spark. There are 20 clusters generated. I am trying to print the clusterID and the elements that got assigned to respective clusterID. How do i loop over the clusterID to print the elements. Thank you guys!! val sc = new SparkContext("local", "KMeansExample","/usr/local/spark/", List("target/scala-2.10/kmeans_2.10-1.0.jar")) // Load and parse the data val data = sc.textFile("kmeans.csv") val parsedData = data.map( s =>

Naive-bayes multinomial text classifier using Data frame in Scala Spark

余生颓废 提交于 2019-12-21 20:22:23
问题 I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new

How to pass params to a ML Pipeline.fit method?

帅比萌擦擦* 提交于 2019-12-21 20:20:07
问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I

Matrix Operation in Spark MLlib in Java

≯℡__Kan透↙ 提交于 2019-12-21 19:47:52
问题 This question is about MLlib (Spark 1.2.1+). What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed). For instance, after computing the SVD of a dataset, I need to perform some matrix operation. The RowMatrix only provide a multiply function. The toBreeze method returns a DenseMatrix<Object> but the API does not seem Java friendly: public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op) In Spark+Java, how to do