apache-spark-ml | 易学教程

How to get word details from TF Vector RDD in Spark ML Lib?

阅读更多关于 How to get word details from TF Vector RDD in Spark ML Lib?

问题 I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...] ,[termFrequencyofWord1, termFrequencyOfWord2 ....] eg: (1048576,[105,3116],[1.0,2.0]) I am able to get the index in hash bucket, using tf.indexOf(\"word\") . But, how can I get the word using the index? 回答1: Well, you can't. Since hashing is non-injective there

MatchError while accessing vector column in Spark 2.0

阅读更多关于 MatchError while accessing vector column in Spark 2.0

问题 I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master(\"local\") .appName(\"my-spark-app\") .config(\"spark.some.config.option\", \"config-value\") .getOrCreate() val df = spark.read.json(\"dbfs:/mnt/JSON6/JSON/sampleDoc.txt\") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params

Create a custom Transformer in PySpark ML

阅读更多关于 Create a custom Transformer in PySpark ML

问题 I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a costume tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one? Thanks. 回答1: Can I extend the default one? Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature , delegates actual processing to its Scala counterpart. Since you want to use Python you should extend

Save ML model for future usage

阅读更多关于 Save ML model for future usage

问题 I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after

How to define a custom aggregation function to sum a column of Vectors?

阅读更多关于 How to define a custom aggregation function to sum a column of Vectors?

问题 I have a DataFrame of two columns, ID of type Int and Vec of type Vector ( org.apache.spark.mllib.linalg.Vector ). The DataFrame looks like follow: ID,Vec 1,[0,0,5] 1,[4,0,1] 1,[1,2,1] 2,[7,5,0] 2,[3,3,4] 3,[0,8,1] 3,[0,0,1] 3,[7,7,7] .... I would like to do a groupBy($\"ID\") then apply an aggregation on the rows inside each group by summing the vectors. The desired output of the above example would be: ID,SumOfVectors 1,[5,2,7] 2,[10,8,4] 3,[7,15,9] ... The available aggregation functions

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

阅读更多关于 Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

问题 I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF(\"f1\", \"f2\", \"f3\") but not for an arbitrary amount of features. Is there an easy way to do this? Example: val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D),

Encode and assemble multiple features in PySpark

阅读更多关于 Encode and assemble multiple features in PySpark

问题 I have a Python class that I\'m using to load and process some data in Spark. Among various things I need to do, I\'m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I\'m not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build

How to split Vector into columns - using PySpark

阅读更多关于 How to split Vector into columns - using PySpark

问题 Context: I have a DataFrame with 2 columns: word and vector. Where the column type of \"vector\" is VectorUDT . An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert | 435 | 5435| 698| 356|.... Question: How can I split a column with vectors in several columns for each dimension using PySpark ? Thanks in advance 回答1: One possible approach is to convert to and from RDD: from pyspark.ml.linalg import Vectors df = sc