apache-spark-mllib | 易学教程

Spark MLlib / K-Means intuition

阅读更多关于 Spark MLlib / K-Means intuition

I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets. As the example shows, I am using this object for

Computing Pointwise Mutual Information in Spark

阅读更多关于 Computing Pointwise Mutual Information in Spark

I'm trying to compute pointwise mutual information (PMI). I have two RDDs as defined here for p(x, y) and p(x) respectively: pii: RDD[((String, String), Double)] pi: RDD[(String, Double)] Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements. val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2))) .join(pi).values .map(x => (x._1._1, (x._1._2, x._1._3, x._2))) .join(pi).values .map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2))) // pmi: org.apache.spark.rdd.RDD[(

From DataFrame to RDD[LabeledPoint]

阅读更多关于 From DataFrame to RDD[LabeledPoint]

问题 I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following: import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.ml.feature.Tokenizer import org.apache.spark.ml.feature.HashingTF import org.apache.spark.ml.feature.IDF val sql = new SQLContext(sc) // Load raw data from a TSV file val raw = sc.textFile("data.tsv").map

Convert Sparse Vector to Dense Vector in Pyspark

阅读更多关于 Convert Sparse Vector to Dense Vector in Pyspark

I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect() I am getting an error like this: 16/12

extracting numpy array from Pyspark Dataframe

阅读更多关于 extracting numpy array from Pyspark Dataframe

问题 I have a dataframe gi_man_df where group can be n : +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+--------+--------------+ | 'GI_MAN'| 7| 3| 124.2| | 'GI_MAN'| 7| 10| 121.15| | 'GI_MAN'| 7| 11| 129.0| | 'GI_MAN'| 7| 12| 125.0| | 'GI_MAN'| 7| 13| 125.0| | 'GI_MAN'| 7| 21| 127.0| | 'GI_MAN'| 7| 22| 126.0| +------------------+-----------------+--------+--------------+ and I am expecting a numpy nd

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

阅读更多关于 Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

问题 I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect

How to use spark Naive Bayes classifier for text classification with IDF?

阅读更多关于 How to use spark Naive Bayes classifier for text classification with IDF?

问题 I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each

How to overwrite Spark ML model in PySpark?

阅读更多关于 How to overwrite Spark ML model in PySpark?

问题 from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData

Spark MLlib: building classifiers for each data group

阅读更多关于 Spark MLlib: building classifiers for each data group

问题 I have labeled vectors (LabeledPoint-s) taged by some group number. For every group I need to create a separate Logistic Regression classifier: import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.{Vector, Vectors} object Scratch { val train = Seq( (1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1