apache-spark-mllib | 易学教程

Spark CountVectorizer return udt instead of vector [duplicate]

阅读更多关于 Spark CountVectorizer return udt instead of vector [duplicate]

问题 This question already has an answer here : Understanding Representation of Vector Column in Spark SQL (1 answer) Closed last year . I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector. I run this short example on my Databricks notebook : import org.apache.spark.ml.feature.CountVectorizer val testW = Seq( (8, Array("Zara", "Nuha", "Ayan", "markle")), (9,

pyspark OneHotEncoded vectors appear to be missing categories?

阅读更多关于 pyspark OneHotEncoded vectors appear to be missing categories?

问题 Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?). After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem Have dataset of the form 1. Wife

java.lang.NoSuchMethodException: <Class>.<init>(java.lang.String) when copying custom Transformer

阅读更多关于 java.lang.NoSuchMethodException: .(java.lang.String) when copying custom Transformer

问题 Currently playing with custom tranformers in my spark-shell using both spark 2.0.1 and 2.2.1. While writing a custom ml transformer, in order to add it to a pipeline, I noticed that there is an issue with the override of the copy method. The copy method is called by the fit method of the TrainValidationSplit in my case. The error I get : java.lang.NoSuchMethodException: Custom.<init>(java.lang.String) at java.lang.Class.getConstructor0(Class.java:3082) at java.lang.Class.getConstructor(Class

Convert Array[DenseVector] to CSV with Scala

阅读更多关于 Convert Array[DenseVector] to CSV with Scala

问题 I am using Kmeans Spark function with Scala and I need to save the Cluster Centers obtained into a CSV. This val is type: Array[DenseVector] . val clusters = KMeans.train(parsedData, numClusters, numIterations) val centers = clusters.clusterCenters I was trying converting centers to a RDD file and then from RDD to DF, but I get a lot of problems (e.g, import spark.implicits._ / SQLContext.implicits._ is not working and I cannot use .toDF ). I was wondering if there is another way to make a

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

阅读更多关于 Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

问题 Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them. How do I handle unseen labels on new data for prediction since

Using LSH in spark to run nearest neighbors query on every point in dataframe

阅读更多关于 Using LSH in spark to run nearest neighbors query on every point in dataframe

问题 I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark. code for creating the model brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n) model = brp.fit(data_df) df_lsh = model.transform(data_df) Now, How do I run approx nearest neighbor query for each point in data_df. I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives

LDA model prediction nonconsistance

阅读更多关于 LDA model prediction nonconsistance

问题 I trained a LDA model and load it into the environment to transform the new data: from pyspark.ml.clustering import LocalLDAModel lda = LocalLDAModel.load(path) df = lda.transform(text) The model will add a new column called topicDistribution . In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice. May I ask the reason why and how to fix it? 回答1: LDA uses randomness when training and, depending on the

Spark: Split RDD elements into chunks

阅读更多关于 Spark: Split RDD elements into chunks

问题 I've written a relatively simple Spark job in Scala which reads some data from S3, performs some transformations and aggregations and finally stores the results into a repository. At the final stage, I have an RDD of my domain model and I would like to group them into chunks of elements so that I can do some mass insertions in my repository. I used the RDDFunctions.sliding method to achieve that and it's working almost fine. Here is a simplified version of my code: val processedElements: RDD

Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

阅读更多关于 Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

问题 By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients. Is this possible with any of the dataset/dataframe-based APIs, preferably Scala? Looking at the Spark source code, it seems that there is a method setInitialModel to initialize

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

阅读更多关于 Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

问题 I am using LogisticRegressionWithLBFGS() to train a model with multiple classes. From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model? 回答1: There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala