apache-spark-mllib | 易学教程

using Word2VecModel.transform() does not work in map function

阅读更多关于 using Word2VecModel.transform() does not work in map function

I have built a Word2Vec model using Spark and save it as a model. Now, I want to use it in another code as offline model. I have loaded the model and used it to present vector of a word (e.g. Hello) and it works well. But, I need to call it for many words in an RDD using map. When I call model.transform() in a map function, it throws this error: "It appears that you are attempting to reference SparkContext from a broadcast " Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver,

How to use mllib.recommendation if the user ids are string instead of contiguous integers?

阅读更多关于 How to use mllib.recommendation if the user ids are string instead of contiguous integers?

问题 I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format: AB123XY45678 CD234WZ12345 EF345OOO1234 GH456XY98765 .... If I want to use the mllib.recommendation library, according to the API of the Rating class, the user ids have to be integers (also have to be contiguous?) It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be

How to extract model hyper-parameters from spark.ml in PySpark?

阅读更多关于 How to extract model hyper-parameters from spark.ml in PySpark?

问题 I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0),

How to assign unique contiguous numbers to elements in a Spark RDD

阅读更多关于 How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review) , and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark. I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n , then call zip on the two RDDs. Starting with Spark 1.0 there are two methods you can use to solve this easily: RDD.zipWithIndex is just like Seq

How to create correct data frame for classification in Spark ML

阅读更多关于 How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B" age and hours_per_week are integers while other features including label salaryRange are categorical (String) Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this: val data = sqlContext.csvFile("/home/dusan

Calculate Cosine Similarity Spark Dataframe

阅读更多关于 Calculate Cosine Similarity Spark Dataframe

问题 I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Sample of the dataframe below +-------+--------------------+ | SKU| Features| +-------+--------------------+ | 9970.0|[4.7143,0.0,5.785...| |19676.0|[5.5,0.0,6.4286,4...| | 3296.0|[4.7143,1.4286,6....| |13658.0|[6.2857,0.7143,4....| | 1.0|[4.2308,0.7692,5....| | 513.0|[3.0,0.0,4.9091,5...| | 3753.0|[5.9231,0.0

Spark CrossValidatorModel access other models than the bestModel?

阅读更多关于 Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also interested in the weighedRecall of all of the models and not just of the model that has performed best during

Split Contents of String column in PySpark Dataframe

阅读更多关于 Split Contents of String column in PySpark Dataframe

I have a pyspark data frame whih has a column containing strings. I want to split this column into words Code: >>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true') >>> sentenceData.show(truncate=False) +---+---------------------------+ |key|desc | +---+---------------------------+ |1 |Virat is good batsman | |2 |sachin was good | |3 |but modi sucks big big time| |4 |I love the formulas | +---+---------------------------+ Expected Output --------------- >>> sentenceData.show(truncate=False) +---+---------------------

Understanding Spark RandomForest featureImportances results

阅读更多关于 Understanding Spark RandomForest featureImportances results

问题 I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output. // org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11], [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0

Spark ALS predictAll returns empty

阅读更多关于 Spark ALS predictAll returns empty

I have the following Python test code (the arguments to ALS.train are defined elsewhere): r1 = (2, 1) r2 = (3, 1) test = sc.parallelize([r1, r2]) model = ALS.train(ratings, rank, numIter, lmbda) predictions = model.predictAll(test) print test.take(1) print predictions.count() print predictions Which works, because it has a count of 1 against the predictions variable and outputs: [(2, 1)] 1 ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423 However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore: model = ALS.train(ratings,