apache-spark-ml | 易学教程

Spark ML VectorAssembler returns strange output

阅读更多关于 Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val parsedData = rawData.filter(row => row != header).map(parseLine) val data = sqlContext.createDataFrame

How to save models from ML Pipeline to S3 or HDFS?

阅读更多关于 How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here , the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path

VectorUDT usage

阅读更多关于 VectorUDT usage

问题 I have to get the datatype and do a case match and convert it to some required format. But the usage of org.apache.spark.ml.linalg.VectorUDT is showing VectorUDT is private . Also I specifically need to use org.apache.spark.ml.linalg.VectorUDT and not org.apache.spark.mllib.linalg.VectorUDT . Can someone suggest how to go about this? 回答1: For org.apache.spark.ml.linalg types you should specify schema using org.apache.spark.ml.linalg.SQLDataTypes which provide singleton instances of the

How to change column metadata in pyspark?

阅读更多关于 How to change column metadata in pyspark?

问题 How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here)? Example listing: # Create DF df.show() # +--

How can I read LIBSVM models (saved using LIBSVM) into PySpark?

阅读更多关于 How can I read LIBSVM models (saved using LIBSVM) into PySpark?

问题 I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following: scaler_path = "path to model" a = MinMaxScaler().load(scaler_path) But I'm thrown an error, expecting a metadata directory: Py4JJavaErrorTraceback (most recent call last) <ipython-input-22-1942e7522174> in <module>() ----> 1 a = MinMaxScaler().load(scaler_path) /srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path) 226 def load

Customize Distance Formular of K-means in Apache Spark Python

阅读更多关于 Customize Distance Formular of K-means in Apache Spark Python

问题 Now I'm using K-means for clustering and following this tutorial and API. But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? 回答1: In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are

Understanding Representation of Vector Column in Spark SQL

阅读更多关于 Understanding Representation of Vector Column in Spark SQL

问题 Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so : | Numerical| HotEncoded1| HotEncoded2 | 14460.0| (44,[5],[1.0])| (3,[0],[1.0])| | 14460.0| (44,[9],[1.0])| (3,[0],[1.0])| | 15181.0| (44,[1],[1.0])| (3,[0],[1.0])| The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes: [(48,[0,1,9],[14460

How can I declare a Column as a categorical feature in a DataFrame for use in ml

阅读更多关于 How can I declare a Column as a categorical feature in a DataFrame for use in ml

问题 How can I declare that a given Column in my DataFrame contains categorical information? I have a Spark SQL DataFrame which I loaded from a database. Many of the columns in this DataFrame have categorical information, but they are encoded as Longs (for privacy). I want to be able to tell spark-ml that even though this column is Numerical the information is actually Categorical. The indexes of categories may have a few holes, and it is acceptable. (Ex. a column may have the values [1, 0, 0 ,4])

Spark ML - Save OneVsRestModel

阅读更多关于 Spark ML - Save OneVsRestModel

问题 I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true)

Spark, ML, StringIndexer: handling unseen labels

阅读更多关于 Spark, ML, StringIndexer: handling unseen labels

问题 My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The pipeline is fitted the training set. The test set has to be processed by the fitted pipeline in order to extract the same feature vectors. Knowing that my test set files have the same structure of the training set. The possible scenario here is to