apache-spark-mllib

how to set Spark Kmeans initial centers

烈酒焚心 提交于 2019-12-24 00:10:19
问题 I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are: [1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0]. So how can I indicate the Kmeans centers are the above three vectors. I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering. Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should

Handle null/NaN values in spark mllib classifier

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 23:04:05
问题 I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my feature Vectors, and the categoricalFeaturesInfo map of the classifier ? option 1 : I tell p values in categoricalFeaturesInfo, and I use Double.NaN in my input Vectors ? side question : How NaNs are handled by

How does Spark's StreamingLinearRegressionWithSGD work?

会有一股神秘感。 提交于 2019-12-23 18:49:38
问题 I am working on StreamingLinearRegressionWithSGD which has two methods trainOn and predictOn. This class has a model object that is updated as training data arrives in the stream specified in trainOn argument. Simultaneously It give prediction using same model. I want to know that how the model weights are updated and synchronized across workers/executors. Any link or reference will be helpful. Thanks. 回答1: There is no magic here. StreamingLinearAlgorithm keeps a mutable reference to the

Stratified sampling with Spark and Java

我怕爱的太早我们不能终老 提交于 2019-12-23 15:57:37
问题 I'd like to make sure I'm training on a stratified sample of my data. It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...) and JavaPairRDD.sampleByKeyExact(...) as explained here. But: My data is stored in a Dataset<Row> , not a JavaPairRDD . The first column is the label, all others are features (imported from a libsvm-formatted file). What's the easiest way to get a stratified sample of my dataset instance and at the end have a Dataset<Row> again? In

MLlib MatrixFactorizationModel recommendProducts(user, num) failing on some users

耗尽温柔 提交于 2019-12-23 12:26:05
问题 I trained a MatrixFactorizationModel model using ALS.train() and now using model.recommendProducts(user, num) to get the top recommended products, but the code fails on some users with the following error: user_products = model.call("recommendProducts", user, prodNum) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 136, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 113, in callJavaFunc return

How to do prediction with Sklearn Model inside Spark?

牧云@^-^@ 提交于 2019-12-23 08:59:31
问题 I have trained a model in python using sklearn. How we can use same model to load in Spark and generate predictions on a spark RDD ? 回答1: Well, I will show an example of linear regression in Sklearn and show you how to use that to predict elements in Spark RDD. First training the model with sklearn example: # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) Here we just have the fit,

Why recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation.MatrixFactorizationModel

南楼画角 提交于 2019-12-23 06:29:55
问题 i have build recommendations system using Spark with ALS collaboratife filtering mllib my snippet code : bestModel.get .predict(toBePredictedBroadcasted.value) evrything is ok, but i need change code for fullfilment requirement, i read from scala doc in here i need to use def recommendProducts but when i tried in my code : bestModel.get.recommendProductsForUsers(100) and error when compile : value recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation

pyspark : ml + streaming

喜你入骨 提交于 2019-12-23 05:17:20
问题 According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark. The issue with the given example (which works on my cluster) is that the testData is a given right on the correct format. I am trying to set up a client <-> server tcp exchange based on strings of data. I can't figure out how to transform the string on the correct format. while this works : sep = ";" str_recue = '0.0;0.1;0.2;0.3;0.4;0.5' rdd = sc.parallelize([str_recue]) chemin

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

与世无争的帅哥 提交于 2019-12-23 03:46:06
问题 In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:

Error with training logistic regression model on Apache Spark. SPARK-5063

て烟熏妆下的殇ゞ 提交于 2019-12-22 18:30:43
问题 I am trying to build a Logistic Regression model with Apache Spark. Here is the code. parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label,