apache-spark-mllib | 易学教程

Customize Distance Formular of K-means in Apache Spark Python

阅读更多关于 Customize Distance Formular of K-means in Apache Spark Python

Now I'm using K-means for clustering and following this tutorial and API . But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? zero323 In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids ) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore

Spark ML - Save OneVsRestModel

阅读更多关于 Spark ML - Save OneVsRestModel

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines . I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true) val ovr = new OneVsRest() ovr.setClassifier(lr) val ovrModel = ovr.fit(training) I would now like to

pyspark Linear Regression Example from official documentation - Bad results?

阅读更多关于 pyspark Linear Regression Example from official documentation - Bad results?

问题 I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here) I also found this question on stackoverflow, which is essentially the same question as mine. The answer suggest to tweak the step size, which I also tried to do, however the results are still as random as without tweaking the step size. The code I'm using looks like this: from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD,

Comparing two arrays and getting the difference in PySpark

阅读更多关于 Comparing two arrays and getting the difference in PySpark

问题 I have two array fields in a data frame. I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame. Expected output is: Column B is a subset of column A. Also the words is going to be in the same order in both arrays. Can any one please help me to get a solution for this? 回答1: You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine: import pandas as pd from pyspark.sql.types

What's the difference between Spark ML and MLLIB packages

阅读更多关于 What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? zero323 o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already

Predict Class Probabilities in Spark RandomForestClassifier

阅读更多关于 Predict Class Probabilities in Spark RandomForestClassifier

问题 I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

Why is Spark Mllib KMeans algorithm extremely slow?

阅读更多关于 Why is Spark Mllib KMeans algorithm extremely slow?

问题 I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the

Feature normalization algorithm in Spark

阅读更多关于 Feature normalization algorithm in Spark

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc. The resulting set is: [-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1

Spark MLLib TFIDF implementation for LogisticRegression

阅读更多关于 Spark MLLib TFIDF implementation for LogisticRegression

I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints? Note: The document lines are in the format [Label; Text] Here my code so far: // 1.) Load the documents JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new"); // 2.) Hash all documents HashingTF tf = new HashingTF(); JavaRDD<Tuple2<Double,