apache-spark-mllib | 易学教程

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol . However, the VectorAssembler only accepts

How to save models from ML Pipeline to S3 or HDFS?

阅读更多关于 How to save models from ML Pipeline to S3 or HDFS?

问题 I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the

Spark DataFrames when udf functions do not accept large enough input variables

阅读更多关于 Spark DataFrames when udf functions do not accept large enough input variables

问题 I am preparing a DataFrame with an id and a vector of my features to be used later for doing predictions. I do a groupBy on my dataframe, and in my groupBy I am merging couple of columns as lists into a new column: def mergeFunction(...) // with 14 input variables val myudffunction( mergeFunction ) // Spark doesn\'t support this df.groupBy(\"id\").agg( collect_list(df(...)) as ... ... // too many of these (something like 14 of them) ).withColumn(\"features_labels\", myudffunction( col(...) ,

Spark ML VectorAssembler returns strange output

阅读更多关于 Spark ML VectorAssembler returns strange output

问题 I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val

ALS model - how to generate full_u * v^t * v?

阅读更多关于 ALS model - how to generate full_u * v^t * v?

问题 I\'m trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I\'ve copied the answer below for the reader\'s convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor

Encode and assemble multiple features in PySpark

阅读更多关于 Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build_feature_arr(self,table): # this dict has keys for all the columns for which I need dummy coding categories

AttributeError: 'DataFrame' object has no attribute 'map'

阅读更多关于 AttributeError: 'DataFrame' object has no attribute 'map'

问题 I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode=\"random\") The detailed error message is: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-11

Is Spark's KMeans unable to handle bigdata?

阅读更多关于 Is Spark's KMeans unable to handle bigdata?

问题 KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely , without yielding an error! Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization): from pyspark.context import SparkContext from pyspark.mllib.clustering import KMeans from pyspark.mllib.random import RandomRDDs if __name__ == \"__main__\":

How to create a custom Estimator in PySpark

阅读更多关于 How to create a custom Estimator in PySpark

问题 I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator . I also don\'t understand what @keyword_only does and why do I need so many setters and getters. Scikit-learn seem to have a proper document for custom models (see here but PySpark doesn\'t. Pseudo code of an example model: class NormalDeviation(): def __init__(self, threshold = 3): def fit(x, y=None): self.model = {\

How to get word details from TF Vector RDD in Spark ML Lib?

阅读更多关于 How to get word details from TF Vector RDD in Spark ML Lib?

问题 I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...] ,[termFrequencyofWord1, termFrequencyOfWord2 ....] eg: (1048576,[105,3116],[1.0,2.0]) I am able to get the index in hash bucket, using tf.indexOf(\"word\") . But, how can I get the word using the index? 回答1: Well, you can't. Since hashing is non-injective there