apache-spark-ml

Spark ML VectorAssembler returns strange output

南楼画角 提交于 2019-11-27 09:29:27
I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val parsedData = rawData.filter(row => row != header).map(parseLine) val data = sqlContext.createDataFrame

How to save models from ML Pipeline to S3 or HDFS?

流过昼夜 提交于 2019-11-27 09:04:01
I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here , the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path

VectorUDT usage

亡梦爱人 提交于 2019-11-27 08:29:57
问题 I have to get the datatype and do a case match and convert it to some required format. But the usage of org.apache.spark.ml.linalg.VectorUDT is showing VectorUDT is private . Also I specifically need to use org.apache.spark.ml.linalg.VectorUDT and not org.apache.spark.mllib.linalg.VectorUDT . Can someone suggest how to go about this? 回答1: For org.apache.spark.ml.linalg types you should specify schema using org.apache.spark.ml.linalg.SQLDataTypes which provide singleton instances of the

How to change column metadata in pyspark?

巧了我就是萌 提交于 2019-11-27 08:08:56
问题 How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here)? Example listing: # Create DF df.show() # +--

How can I read LIBSVM models (saved using LIBSVM) into PySpark?

家住魔仙堡 提交于 2019-11-27 07:24:16
问题 I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following: scaler_path = "path to model" a = MinMaxScaler().load(scaler_path) But I'm thrown an error, expecting a metadata directory: Py4JJavaErrorTraceback (most recent call last) <ipython-input-22-1942e7522174> in <module>() ----> 1 a = MinMaxScaler().load(scaler_path) /srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path) 226 def load

Customize Distance Formular of K-means in Apache Spark Python

独自空忆成欢 提交于 2019-11-27 07:19:36
问题 Now I'm using K-means for clustering and following this tutorial and API. But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? 回答1: In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are

Understanding Representation of Vector Column in Spark SQL

一曲冷凌霜 提交于 2019-11-27 07:12:15
问题 Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so : | Numerical| HotEncoded1| HotEncoded2 | 14460.0| (44,[5],[1.0])| (3,[0],[1.0])| | 14460.0| (44,[9],[1.0])| (3,[0],[1.0])| | 15181.0| (44,[1],[1.0])| (3,[0],[1.0])| The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes: [(48,[0,1,9],[14460

How can I declare a Column as a categorical feature in a DataFrame for use in ml

≡放荡痞女 提交于 2019-11-27 06:20:24
问题 How can I declare that a given Column in my DataFrame contains categorical information? I have a Spark SQL DataFrame which I loaded from a database. Many of the columns in this DataFrame have categorical information, but they are encoded as Longs (for privacy). I want to be able to tell spark-ml that even though this column is Numerical the information is actually Categorical. The indexes of categories may have a few holes, and it is acceptable. (Ex. a column may have the values [1, 0, 0 ,4])

Spark ML - Save OneVsRestModel

眉间皱痕 提交于 2019-11-27 06:15:00
问题 I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true)

Spark, ML, StringIndexer: handling unseen labels

♀尐吖头ヾ 提交于 2019-11-27 05:38:07
问题 My goal is to build a multicalss classifier. I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step. The pipeline is fitted the training set. The test set has to be processed by the fitted pipeline in order to extract the same feature vectors. Knowing that my test set files have the same structure of the training set. The possible scenario here is to