apache-spark-mllib

Spark MLlib / K-Means intuition

寵の児 提交于 2019-12-05 17:50:26
I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets. As the example shows, I am using this object for

Computing Pointwise Mutual Information in Spark

本秂侑毒 提交于 2019-12-05 17:02:41
I'm trying to compute pointwise mutual information (PMI). I have two RDDs as defined here for p(x, y) and p(x) respectively: pii: RDD[((String, String), Double)] pi: RDD[(String, Double)] Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements. val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2))) .join(pi).values .map(x => (x._1._1, (x._1._2, x._1._3, x._2))) .join(pi).values .map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2))) // pmi: org.apache.spark.rdd.RDD[(

From DataFrame to RDD[LabeledPoint]

拈花ヽ惹草 提交于 2019-12-05 14:37:19
问题 I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following: import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.ml.feature.Tokenizer import org.apache.spark.ml.feature.HashingTF import org.apache.spark.ml.feature.IDF val sql = new SQLContext(sc) // Load raw data from a TSV file val raw = sc.textFile("data.tsv").map

Convert Sparse Vector to Dense Vector in Pyspark

只愿长相守 提交于 2019-12-05 12:59:32
I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect() I am getting an error like this: 16/12

extracting numpy array from Pyspark Dataframe

孤者浪人 提交于 2019-12-05 11:35:13
问题 I have a dataframe gi_man_df where group can be n : +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+--------+--------------+ | 'GI_MAN'| 7| 3| 124.2| | 'GI_MAN'| 7| 10| 121.15| | 'GI_MAN'| 7| 11| 129.0| | 'GI_MAN'| 7| 12| 125.0| | 'GI_MAN'| 7| 13| 125.0| | 'GI_MAN'| 7| 21| 127.0| | 'GI_MAN'| 7| 22| 126.0| +------------------+-----------------+--------+--------------+ and I am expecting a numpy nd

How can I train a random forest with a sparse matrix in Spark?

此生再无相见时 提交于 2019-12-05 08:32:39
Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

随声附和 提交于 2019-12-05 03:24:14
问题 I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect

How to use spark Naive Bayes classifier for text classification with IDF?

喜夏-厌秋 提交于 2019-12-04 23:48:29
问题 I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each

How to overwrite Spark ML model in PySpark?

大憨熊 提交于 2019-12-04 23:13:44
问题 from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData

Spark MLlib: building classifiers for each data group

回眸只為那壹抹淺笑 提交于 2019-12-04 22:06:59
问题 I have labeled vectors (LabeledPoint-s) taged by some group number. For every group I need to create a separate Logistic Regression classifier: import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.{Vector, Vectors} object Scratch { val train = Seq( (1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1