apache-spark-mllib

using Word2VecModel.transform() does not work in map function

落爺英雄遲暮 提交于 2019-11-28 00:19:11
I have built a Word2Vec model using Spark and save it as a model. Now, I want to use it in another code as offline model. I have loaded the model and used it to present vector of a word (e.g. Hello) and it works well. But, I need to call it for many words in an RDD using map. When I call model.transform() in a map function, it throws this error: "It appears that you are attempting to reference SparkContext from a broadcast " Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver,

How to use mllib.recommendation if the user ids are string instead of contiguous integers?

筅森魡賤 提交于 2019-11-27 21:46:05
问题 I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format: AB123XY45678 CD234WZ12345 EF345OOO1234 GH456XY98765 .... If I want to use the mllib.recommendation library, according to the API of the Rating class, the user ids have to be integers (also have to be contiguous?) It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be

How to extract model hyper-parameters from spark.ml in PySpark?

别来无恙 提交于 2019-11-27 20:32:52
问题 I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0),

How to assign unique contiguous numbers to elements in a Spark RDD

左心房为你撑大大i 提交于 2019-11-27 17:41:48
I have a dataset of (user, product, review) , and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark. I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n , then call zip on the two RDDs. Starting with Spark 1.0 there are two methods you can use to solve this easily: RDD.zipWithIndex is just like Seq

How to create correct data frame for classification in Spark ML

删除回忆录丶 提交于 2019-11-27 17:24:06
I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B" age and hours_per_week are integers while other features including label salaryRange are categorical (String) Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this: val data = sqlContext.csvFile("/home/dusan

Calculate Cosine Similarity Spark Dataframe

…衆ロ難τιáo~ 提交于 2019-11-27 16:14:50
问题 I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Sample of the dataframe below +-------+--------------------+ | SKU| Features| +-------+--------------------+ | 9970.0|[4.7143,0.0,5.785...| |19676.0|[5.5,0.0,6.4286,4...| | 3296.0|[4.7143,1.4286,6....| |13658.0|[6.2857,0.7143,4....| | 1.0|[4.2308,0.7692,5....| | 513.0|[3.0,0.0,4.9091,5...| | 3753.0|[5.9231,0.0

Spark CrossValidatorModel access other models than the bestModel?

南笙酒味 提交于 2019-11-27 16:04:55
I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also interested in the weighedRecall of all of the models and not just of the model that has performed best during

Split Contents of String column in PySpark Dataframe

青春壹個敷衍的年華 提交于 2019-11-27 15:55:38
I have a pyspark data frame whih has a column containing strings. I want to split this column into words Code: >>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true') >>> sentenceData.show(truncate=False) +---+---------------------------+ |key|desc | +---+---------------------------+ |1 |Virat is good batsman | |2 |sachin was good | |3 |but modi sucks big big time| |4 |I love the formulas | +---+---------------------------+ Expected Output --------------- >>> sentenceData.show(truncate=False) +---+---------------------

Understanding Spark RandomForest featureImportances results

纵然是瞬间 提交于 2019-11-27 15:25:08
问题 I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output. // org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11], [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0

Spark ALS predictAll returns empty

為{幸葍}努か 提交于 2019-11-27 15:17:06
I have the following Python test code (the arguments to ALS.train are defined elsewhere): r1 = (2, 1) r2 = (3, 1) test = sc.parallelize([r1, r2]) model = ALS.train(ratings, rank, numIter, lmbda) predictions = model.predictAll(test) print test.take(1) print predictions.count() print predictions Which works, because it has a count of 1 against the predictions variable and outputs: [(2, 1)] 1 ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423 However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore: model = ALS.train(ratings,