apache-spark-mllib | 易学教程

how to set Spark Kmeans initial centers

阅读更多关于 how to set Spark Kmeans initial centers

问题 I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are: [1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0]. So how can I indicate the Kmeans centers are the above three vectors. I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering. Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should

Handle null/NaN values in spark mllib classifier

阅读更多关于 Handle null/NaN values in spark mllib classifier

问题 I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my feature Vectors, and the categoricalFeaturesInfo map of the classifier ? option 1 : I tell p values in categoricalFeaturesInfo, and I use Double.NaN in my input Vectors ? side question : How NaNs are handled by

How does Spark's StreamingLinearRegressionWithSGD work?

阅读更多关于 How does Spark's StreamingLinearRegressionWithSGD work?

问题 I am working on StreamingLinearRegressionWithSGD which has two methods trainOn and predictOn. This class has a model object that is updated as training data arrives in the stream specified in trainOn argument. Simultaneously It give prediction using same model. I want to know that how the model weights are updated and synchronized across workers/executors. Any link or reference will be helpful. Thanks. 回答1: There is no magic here. StreamingLinearAlgorithm keeps a mutable reference to the

Stratified sampling with Spark and Java

阅读更多关于 Stratified sampling with Spark and Java

问题 I'd like to make sure I'm training on a stratified sample of my data. It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...) and JavaPairRDD.sampleByKeyExact(...) as explained here. But: My data is stored in a Dataset<Row> , not a JavaPairRDD . The first column is the label, all others are features (imported from a libsvm-formatted file). What's the easiest way to get a stratified sample of my dataset instance and at the end have a Dataset<Row> again? In

MLlib MatrixFactorizationModel recommendProducts(user, num) failing on some users

阅读更多关于 MLlib MatrixFactorizationModel recommendProducts(user, num) failing on some users

问题 I trained a MatrixFactorizationModel model using ALS.train() and now using model.recommendProducts(user, num) to get the top recommended products, but the code fails on some users with the following error: user_products = model.call("recommendProducts", user, prodNum) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 136, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File "/usr/lib/spark/python/pyspark/mllib/common.py", line 113, in callJavaFunc return

How to do prediction with Sklearn Model inside Spark?

阅读更多关于 How to do prediction with Sklearn Model inside Spark?

问题 I have trained a model in python using sklearn. How we can use same model to load in Spark and generate predictions on a spark RDD ? 回答1: Well, I will show an example of linear regression in Sklearn and show you how to use that to predict elements in Spark RDD. First training the model with sklearn example: # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) Here we just have the fit,

Why recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation.MatrixFactorizationModel

阅读更多关于 Why recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation.MatrixFactorizationModel

问题 i have build recommendations system using Spark with ALS collaboratife filtering mllib my snippet code : bestModel.get .predict(toBePredictedBroadcasted.value) evrything is ok, but i need change code for fullfilment requirement, i read from scala doc in here i need to use def recommendProducts but when i tried in my code : bestModel.get.recommendProductsForUsers(100) and error when compile : value recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation

pyspark : ml + streaming

阅读更多关于 pyspark : ml + streaming

问题 According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark. The issue with the given example (which works on my cluster) is that the testData is a given right on the correct format. I am trying to set up a client <-> server tcp exchange based on strings of data. I can't figure out how to transform the string on the correct format. while this works : sep = ";" str_recue = '0.0;0.1;0.2;0.3;0.4;0.5' rdd = sc.parallelize([str_recue]) chemin

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

阅读更多关于 Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

问题 In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:

Error with training logistic regression model on Apache Spark. SPARK-5063

阅读更多关于 Error with training logistic regression model on Apache Spark. SPARK-5063

问题 I am trying to build a Logistic Regression model with Apache Spark. Here is the code. parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label,