apache-spark-ml | 易学教程

Apache Spark throws NullPointerException when encountering missing feature

阅读更多关于 Apache Spark throws NullPointerException when encountering missing feature

I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer: import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc.addPyFile('pyspark_csv.py') features = pycsv.csvToDataFrame(sqlCtx, sc.textFile('tmp.csv')) indexer =

Customize Distance Formular of K-means in Apache Spark Python

阅读更多关于 Customize Distance Formular of K-means in Apache Spark Python

Now I'm using K-means for clustering and following this tutorial and API . But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? zero323 In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids ) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore

Spark ML - Save OneVsRestModel

阅读更多关于 Spark ML - Save OneVsRestModel

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines . I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true) val ovr = new OneVsRest() ovr.setClassifier(lr) val ovrModel = ovr.fit(training) I would now like to

What's the difference between Spark ML and MLLIB packages

阅读更多关于 What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? zero323 o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already

Fit a dataframe into randomForest pyspark

阅读更多关于 Fit a dataframe into randomForest pyspark

问题 I have a DataFrame that looks like this: +--------------------+------------------+ | features| labels | +--------------------+------------------+ |[-0.38475, 0.568...]| label1 | |[0.645734, 0.699...]| label2 | | ..... | ... | +--------------------+------------------+ Both columns are of String type (StringType()), I would like to fit this into spark ml randomForest. To do so, I need to convert the features columns into a vector containing floats. Does any one have any idea How to do so ? 回答1:

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

Feature normalization algorithm in Spark

阅读更多关于 Feature normalization algorithm in Spark

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc. The resulting set is: [-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1

Create feature vector programmatically in Spark ML / pyspark

阅读更多关于 Create feature vector programmatically in Spark ML / pyspark

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code. The solution I'd like to improve: from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row from pyspark.ml.clustering import KMeans, KMeansModel iris =

Spark, ML, StringIndexer: handling unseen labels

阅读更多关于 Spark, ML, StringIndexer: handling unseen labels

My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The pipeline is fitted the training set. The test set has to be processed by the fitted pipeline in order to extract the same feature vectors. Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label,

Spark Multiclass Classification Example

阅读更多关于 Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. zero323 ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer} import org.apache