apache-spark-mllib | 易学教程

Pyspark random forest feature importance mapping after column transformations

阅读更多关于 Pyspark random forest feature importance mapping after column transformations

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark. Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this - use string indexer to index string columns use one hot encoder for all columns use a vectorassembler to create the feature column containing the feature vector Some sample code from the docs for steps 1,2,3 - from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categoricalColumns = ["workclass",

How to map variable names to features after pipeline

阅读更多关于 How to map variable names to features after pipeline

问题 I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df)

How to save and load MLLib model in Apache Spark?

阅读更多关于 How to save and load MLLib model in Apache Spark?

问题 I trained a classification model in Apache Spark (using pyspark ). I stored the model in an object, LogisticRegressionModel . Now, I want to make predictions on new data. I would like to store the model, and read it back into a new program in order to make the predictions. Any idea how to store the model? I'm thinking of maybe pickle, but I'm a newbie to both python and Spark, so I'd like to hear what the community thinks. 回答1: You can save your model by using the save method of mllib models.

Spark ML VectorAssembler returns strange output

阅读更多关于 Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val parsedData = rawData.filter(row => row != header).map(parseLine) val data = sqlContext.createDataFrame

Dealing with unbalanced datasets in Spark MLlib

阅读更多关于 Dealing with unbalanced datasets in Spark MLlib

问题 I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib. I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected. I would appreciate any feedback regarding your experience

Spark mllib predicting weird number or NaN

阅读更多关于 Spark mllib predicting weird number or NaN

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points: "365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 Here's my code: def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def sanitize(value): return float(value.strip('"')) parsedData = textFile.map(parsePoint) model =

How to save models from ML Pipeline to S3 or HDFS?

阅读更多关于 How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here , the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path

VectorUDT usage

阅读更多关于 VectorUDT usage

问题 I have to get the datatype and do a case match and convert it to some required format. But the usage of org.apache.spark.ml.linalg.VectorUDT is showing VectorUDT is private . Also I specifically need to use org.apache.spark.ml.linalg.VectorUDT and not org.apache.spark.mllib.linalg.VectorUDT . Can someone suggest how to go about this? 回答1: For org.apache.spark.ml.linalg types you should specify schema using org.apache.spark.ml.linalg.SQLDataTypes which provide singleton instances of the

Handling continuous data in Spark NaiveBayes

阅读更多关于 Handling continuous data in Spark NaiveBayes

问题 As per official documentation of Spark NaiveBayes: It supports Multinomial NB (see here) which can handle finitely supported discrete data. How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes? 回答1: The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a

Customize Distance Formular of K-means in Apache Spark Python

阅读更多关于 Customize Distance Formular of K-means in Apache Spark Python

问题 Now I'm using K-means for clustering and following this tutorial and API. But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark? 回答1: In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances. See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation. Moreover MLlib algorithms are