apache-spark-ml

Run ML algorithm inside map function in Spark

走远了吗. 提交于 2019-11-26 22:10:53
问题 So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error: AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized? Obviously I cannot reference SparkContext inside the apply_classifier function. My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for: def

Why does StandardScaler not attach metadata to the output column?

社会主义新天地 提交于 2019-11-26 22:00:14
问题 I noticed that the ml StandardScaler does not attach metadata to the output column: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature._ val df = spark.read.option("header", true) .option("inferSchema", true) .csv("/path/to/cars.data") val strId1 = new StringIndexer() .setInputCol("v7") .setOutputCol("v7_IDX") val strId2 = new StringIndexer() .setInputCol("v8") .setOutputCol("v8_IDX") val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(Array("v0",

How to create a custom Transformer from a UDF?

空扰寡人 提交于 2019-11-26 21:35:52
问题 I was trying to create and save a Pipeline with custom stages. I need to add a column to my DataFrame by using a UDF . Therefore, I was wondering if it was possible to convert a UDF or a similar action into a Transformer ? My custom UDF looks like this and I'd like to learn how to do it using an UDF as a custom Transformer . def getFeatures(n: String) = { val NUMBER_FEATURES = 4 val name = n.split(" +")(0).toLowerCase ((1 to NUMBER_FEATURES) .filter(size => size <= name.length) .map(size =>

MatchError while accessing vector column in Spark 2.0

天涯浪子 提交于 2019-11-26 21:06:51
I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate() val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params for RegexTokenizer val tokenizer = new RegexTokenizer() .setPattern("[\\W_]+") .setMinTokenLength(4) // Filter

Create a custom Transformer in PySpark ML

穿精又带淫゛_ 提交于 2019-11-26 20:04:30
I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a costume tokenizer, which for example removes stop words and uses some libraries from nltk ? Can I extend the default one? Thanks. zero323 Can I extend the default one? Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature , delegates actual processing to its Scala counterpart. Since you want to use Python you should extend pyspark.ml.pipeline.Transformer directly. import nltk from pyspark import keyword_only ## < 2.0 ->

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

点点圈 提交于 2019-11-26 20:02:01
问题 I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025. Given that there is no option provided by Pyspark ML pipeline for saving a custom transformer written in python, what are the other options to get it done? How can I implement the _to_java method in my python class that returns a compatible java object? 回答1: As of

spark.ml StringIndexer throws 'Unseen label' on fit()

允我心安 提交于 2019-11-26 19:06:50
I'm preparing a toy spark.ml example. Spark version 1.6.0 , running on top of Oracle JDK version 1.8.0_65 , pyspark, ipython notebook. First, it hardly has anything to do with Spark, ML, StringIndexer: handling unseen labels . The exception is thrown while fitting a pipeline to a dataset, not transforming it. And suppressing the exception might not be a solution here, since, I'm afraid, the dataset gets messed pretty bad in this case. My dataset is about 800Mb uncompressed, so it might be hard to reproduce (smaller subsets seem to dodge this issue). The dataset looks like this: +--------------

Dropping a nested column from Spark DataFrame

冷暖自知 提交于 2019-11-26 18:55:39
I have a DataFrame with the schema root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) While, I am able to filter the data frame using val data = rawData .filter( !(rawData("features.feat1") <=> "100") ) I am unable to drop the columns using val data = rawData .drop("features.feat1") Is it something that I am doing wrong here? I also tried (unsuccessfully) doing drop(rawData("features.feat1")) , though it does not make much sense to do so. Thanks in

Save ML model for future usage

筅森魡賤 提交于 2019-11-26 18:49:47
I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters). The cluster I use isn't so

How to define a custom aggregation function to sum a column of Vectors?

好久不见. 提交于 2019-11-26 17:36:21
I have a DataFrame of two columns, ID of type Int and Vec of type Vector ( org.apache.spark.mllib.linalg.Vector ). The DataFrame looks like follow: ID,Vec 1,[0,0,5] 1,[4,0,1] 1,[1,2,1] 2,[7,5,0] 2,[3,3,4] 3,[0,8,1] 3,[0,0,1] 3,[7,7,7] .... I would like to do a groupBy($"ID") then apply an aggregation on the rows inside each group by summing the vectors. The desired output of the above example would be: ID,SumOfVectors 1,[5,2,7] 2,[10,8,4] 3,[7,15,9] ... The available aggregation functions will not work, e.g. df.groupBy($"ID").agg(sum($"Vec") will lead to an ClassCastException. How to implement