apache-spark-mllib | 易学教程

Understanding Representation of Vector Column in Spark SQL

阅读更多关于 Understanding Representation of Vector Column in Spark SQL

问题 Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so : | Numerical| HotEncoded1| HotEncoded2 | 14460.0| (44,[5],[1.0])| (3,[0],[1.0])| | 14460.0| (44,[9],[1.0])| (3,[0],[1.0])| | 15181.0| (44,[1],[1.0])| (3,[0],[1.0])| The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes: [(48,[0,1,9],[14460

Apache Spark MLlib Model File Format

阅读更多关于 Apache Spark MLlib Model File Format

问题 Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath ) where it creates two directories, viz. myModelPath/data and myModelPath/metadata . There are multiple files in these paths and those are not text files. There are some files of format *.parquet . I have couple of questions: What are the format of these files? Which file/files contain actual model? Can I save the model to somewhere else, for example in a DB? 回答1: Spark >= 2.4 Since Spark 2.4

Spark ML - Save OneVsRestModel

阅读更多关于 Spark ML - Save OneVsRestModel

问题 I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint] . I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data. Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so: val lr = new LogisticRegression().setFitIntercept(true)

Spark DataFrames when udf functions do not accept large enough input variables

阅读更多关于 Spark DataFrames when udf functions do not accept large enough input variables

I am preparing a DataFrame with an id and a vector of my features to be used later for doing predictions. I do a groupBy on my dataframe, and in my groupBy I am merging couple of columns as lists into a new column: def mergeFunction(...) // with 14 input variables val myudffunction( mergeFunction ) // Spark doesn't support this df.groupBy("id").agg( collect_list(df(...)) as ... ... // too many of these (something like 14 of them) ).withColumn("features_labels", myudffunction( col(...) , col(...) ) .select("id", "feature_labels") This is how I am creating my feature vectors and their labels. It

ALS model - how to generate full_u * v^t * v?

阅读更多关于 ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer . I've copied the answer below for the reader's convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives

AttributeError: 'DataFrame' object has no attribute 'map'

阅读更多关于 AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random") The detailed error message is: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-11-a19a1763d3ac> in <module>() 1 from pyspark.mllib.clustering import KMeans 2 spark_df = sqlContext

Spark, DataFrame: apply transformer/estimator on groups

阅读更多关于 Spark, DataFrame: apply transformer/estimator on groups

问题 I have a DataFrame that looks like follow: +-----------+-----+------------+ | userID|group| features| +-----------+-----+------------+ |12462563356| 1| [5.0,43.0]| |12462563701| 2| [1.0,8.0]| |12462563701| 1| [2.0,12.0]| |12462564356| 1| [1.0,1.0]| |12462565487| 3| [2.0,3.0]| |12462565698| 2| [1.0,1.0]| |12462565698| 1| [1.0,1.0]| |12462566081| 2| [1.0,2.0]| |12462566081| 1| [1.0,15.0]| |12462566225| 2| [1.0,1.0]| |12462566225| 1| [9.0,85.0]| |12462566526| 2| [1.0,1.0]| |12462566526| 1| [3.0

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

阅读更多关于 Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

问题 I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm. For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification. The only difference I can find is that the one in org.apache.spark.ml is inherited from Estimator and was able to be used in cross validation. I was quite confused that they are placed in

Pyspark random forest feature importance mapping after column transformations

阅读更多关于 Pyspark random forest feature importance mapping after column transformations

问题 I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark. Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this - use string indexer to index string columns use one hot encoder for all columns use a vectorassembler to create the feature column containing the feature vector Some sample code from the docs for steps 1,2,3 - from pyspark.ml import Pipeline from pyspark.ml

Is Spark's KMeans unable to handle bigdata?

阅读更多关于 Is Spark's KMeans unable to handle bigdata?

KMeans has several parameters for its training , with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely , without yielding an error! Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization): from pyspark.context import SparkContext from pyspark.mllib.clustering import KMeans from pyspark.mllib.random import RandomRDDs if __name__ == "__main__": sc = SparkContext(appName='kmeansMinimalExample') # same with 10000 points data = RandomRDDs