apache-spark-ml

Issue with VectorUDT when using Spark ML

£可爱£侵袭症+ 提交于 2019-12-09 17:02:56
问题 I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas: def inputSchema = new StructType().add("features", new VectorUDT()) def bufferSchema: StructType = StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

pyspark.sql.utils.IllegalArgumentException: u'Field “features” does not exist.'

安稳与你 提交于 2019-12-08 11:35:36
问题 I am trying to execute Random Forest Classifier and evaluate the model using Cross Validation. I work with pySpark. The input CSV file is loaded as Spark DataFrame format. But I face a issue while constructing the model. Below is the code. from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.ml import Pipeline from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import

How to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection?

橙三吉。 提交于 2019-12-08 10:21:01
问题 Firstly, I use spark 1.6.0. I want to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection. But I can not get the detailed coefficients when calling the function: lr = LogisticRegression(elasticNetParam=1.0, regParam=0.01,maxIter=100,fitIntercept=False,standardization=False) model = lr.fit(df_one_hot_train) print model.coefficients.toArray().astype(float).tolist() I only get sparse list like: [0,0,0,0,0,..,-0.0871650387514,..,] While when I use sklearn.linear

Formatting data for Spark ML

萝らか妹 提交于 2019-12-08 09:33:46
问题 I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread "main" java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,false) is not supported. It happens when using VectorAssembler. val generatedData = KMeansDataGenerator.generateKMeansRDD(sc, numPoints = 1000, k = 5, d = 3, r = 5, numPartitions

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

守給你的承諾、 提交于 2019-12-08 08:29:28
In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:///C:/files/project/file.csv", "/file.csv" */ Dataset<Row> logData = spark_session.read().csv(logFile);

Adding a Vectors Column to a pyspark DataFrame

孤街醉人 提交于 2019-12-08 06:35:58
问题 How do I add a Vectors.dense column to a pyspark dataframe? import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.ml.linalg import DenseVector py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]}) sc = SparkContext(master="local") sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(py_df) sdf.withColumn("features", DenseVector(1)) Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql

PySpark ML: Get KMeans cluster statistics

扶醉桌前 提交于 2019-12-08 03:54:29
问题 I have built a KMeansModel. My results are stored in a PySpark DataFrame called transformed . (a) How do I interpret the contents of transformed ? (b) How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters? from pyspark.ml.clustering import KMeans # Trains a k-means model. kmeans = KMeans().setK(14).setSeed(1) model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional

Get Column Names after columnSimilarties() Spark scala

99封情书 提交于 2019-12-07 22:06:52
问题 I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add

How to deserialize Pipeline model in spark.ml?

江枫思渺然 提交于 2019-12-07 08:34:19
问题 I have serialized a Spark ML Pipeline model that consists of a number of TransformerS (org.apache.spark.ml.Transformer) and several Logistic Regression learners (org.apache.spark.ml.classification.LogisticRegression). It all works fine on my Windows machine where I created the model. I serialized the model to disk using java.io.ObjectOutputStream and read it back in using java.io.ObjectInputStream. It all works fine via sbt and my corresponding unit tests. However, when I assemble my code

How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

风格不统一 提交于 2019-12-07 03:13:32
问题 I am using Spark cluster 2.0 and I would like to convert a vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT . # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() modelA = lr.fit(data, {lr.regParam:0.0}) Error: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg