apache-spark-ml

Apache Spark throws NullPointerException when encountering missing feature

◇◆丶佛笑我妖孽 提交于 2019-11-28 13:18:44
I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer: import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc.addPyFile('pyspark_csv.py') features = pycsv.csvToDataFrame(sqlCtx, sc.textFile('tmp.csv')) indexer =

Customize Distance Formular of K-means in Apache Spark Python

我是研究僧i 提交于 2019-11-28 13:00:19
Now I'm using K-means for clustering and following this tutorial and API . But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? zero323 In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids ) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore

Spark ML - Save OneVsRestModel

风流意气都作罢 提交于 2019-11-28 11:31:22
I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines . I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true) val ovr = new OneVsRest() ovr.setClassifier(lr) val ovrModel = ovr.fit(training) I would now like to

What's the difference between Spark ML and MLLIB packages

纵饮孤独 提交于 2019-11-28 09:37:10
I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable , while the other one does not. By the way ame is true about RandomForestModel . Why is there two classes? Which is the "right" one? And is there a way to convert one into another? zero323 o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already

Fit a dataframe into randomForest pyspark

元气小坏坏 提交于 2019-11-28 09:19:35
问题 I have a DataFrame that looks like this: +--------------------+------------------+ | features| labels | +--------------------+------------------+ |[-0.38475, 0.568...]| label1 | |[0.645734, 0.699...]| label2 | | ..... | ... | +--------------------+------------------+ Both columns are of String type (StringType()), I would like to fit this into spark ml randomForest. To do so, I need to convert the features columns into a vector containing floats. Does any one have any idea How to do so ? 回答1:

How to cross validate RandomForest model?

房东的猫 提交于 2019-11-28 08:16:14
I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

Feature normalization algorithm in Spark

狂风中的少年 提交于 2019-11-28 07:47:24
Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors: {0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, {-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc. The resulting set is: [-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1

Create feature vector programmatically in Spark ML / pyspark

邮差的信 提交于 2019-11-28 06:54:40
I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code. The solution I'd like to improve: from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row from pyspark.ml.clustering import KMeans, KMeansModel iris =

Spark, ML, StringIndexer: handling unseen labels

て烟熏妆下的殇ゞ 提交于 2019-11-28 06:29:53
My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The pipeline is fitted the training set. The test set has to be processed by the fitted pipeline in order to extract the same feature vectors. Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label,

Spark Multiclass Classification Example

故事扮演 提交于 2019-11-28 06:06:40
Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. zero323 ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer} import org.apache