apache-spark-mllib

Spark CountVectorizer return udt instead of vector [duplicate]

余生颓废 提交于 2019-12-25 03:15:37
问题 This question already has an answer here : Understanding Representation of Vector Column in Spark SQL (1 answer) Closed last year . I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector. I run this short example on my Databricks notebook : import org.apache.spark.ml.feature.CountVectorizer val testW = Seq( (8, Array("Zara", "Nuha", "Ayan", "markle")), (9,

pyspark OneHotEncoded vectors appear to be missing categories?

倾然丶 夕夏残阳落幕 提交于 2019-12-25 02:21:07
问题 Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?). After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem Have dataset of the form 1. Wife

java.lang.NoSuchMethodException: <Class>.<init>(java.lang.String) when copying custom Transformer

我是研究僧i 提交于 2019-12-25 00:24:46
问题 Currently playing with custom tranformers in my spark-shell using both spark 2.0.1 and 2.2.1. While writing a custom ml transformer, in order to add it to a pipeline, I noticed that there is an issue with the override of the copy method. The copy method is called by the fit method of the TrainValidationSplit in my case. The error I get : java.lang.NoSuchMethodException: Custom.<init>(java.lang.String) at java.lang.Class.getConstructor0(Class.java:3082) at java.lang.Class.getConstructor(Class

Convert Array[DenseVector] to CSV with Scala

╄→尐↘猪︶ㄣ 提交于 2019-12-24 21:06:21
问题 I am using Kmeans Spark function with Scala and I need to save the Cluster Centers obtained into a CSV. This val is type: Array[DenseVector] . val clusters = KMeans.train(parsedData, numClusters, numIterations) val centers = clusters.clusterCenters I was trying converting centers to a RDD file and then from RDD to DF, but I get a lot of problems (e.g, import spark.implicits._ / SQLContext.implicits._ is not working and I cannot use .toDF ). I was wondering if there is another way to make a

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

末鹿安然 提交于 2019-12-24 17:24:42
问题 Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them. How do I handle unseen labels on new data for prediction since

Using LSH in spark to run nearest neighbors query on every point in dataframe

。_饼干妹妹 提交于 2019-12-24 12:34:45
问题 I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark. code for creating the model brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n) model = brp.fit(data_df) df_lsh = model.transform(data_df) Now, How do I run approx nearest neighbor query for each point in data_df. I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives

LDA model prediction nonconsistance

冷暖自知 提交于 2019-12-24 11:37:12
问题 I trained a LDA model and load it into the environment to transform the new data: from pyspark.ml.clustering import LocalLDAModel lda = LocalLDAModel.load(path) df = lda.transform(text) The model will add a new column called topicDistribution . In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice. May I ask the reason why and how to fix it? 回答1: LDA uses randomness when training and, depending on the

Spark: Split RDD elements into chunks

吃可爱长大的小学妹 提交于 2019-12-24 09:38:50
问题 I've written a relatively simple Spark job in Scala which reads some data from S3, performs some transformations and aggregations and finally stores the results into a repository. At the final stage, I have an RDD of my domain model and I would like to group them into chunks of elements so that I can do some mass insertions in my repository. I used the RDDFunctions.sliding method to achieve that and it's working almost fine. Here is a simplified version of my code: val processedElements: RDD

Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

China☆狼群 提交于 2019-12-24 02:05:14
问题 By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients. Is this possible with any of the dataset/dataframe-based APIs, preferably Scala? Looking at the Spark source code, it seems that there is a method setInitialModel to initialize

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

落爺英雄遲暮 提交于 2019-12-24 00:45:43
问题 I am using LogisticRegressionWithLBFGS() to train a model with multiple classes. From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model? 回答1: There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala