apache-spark-ml | 易学教程

Issue with VectorUDT when using Spark ML

阅读更多关于 Issue with VectorUDT when using Spark ML

问题 I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas: def inputSchema = new StructType().add("features", new VectorUDT()) def bufferSchema: StructType = StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

pyspark.sql.utils.IllegalArgumentException: u'Field “features” does not exist.'

阅读更多关于 pyspark.sql.utils.IllegalArgumentException: u'Field “features” does not exist.'

问题 I am trying to execute Random Forest Classifier and evaluate the model using Cross Validation. I work with pySpark. The input CSV file is loaded as Spark DataFrame format. But I face a issue while constructing the model. Below is the code. from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.ml import Pipeline from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import

How to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection?

阅读更多关于 How to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection?

问题 Firstly, I use spark 1.6.0. I want to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection. But I can not get the detailed coefficients when calling the function: lr = LogisticRegression(elasticNetParam=1.0, regParam=0.01,maxIter=100,fitIntercept=False,standardization=False) model = lr.fit(df_one_hot_train) print model.coefficients.toArray().astype(float).tolist() I only get sparse list like: [0,0,0,0,0,..,-0.0871650387514,..,] While when I use sklearn.linear

Formatting data for Spark ML

阅读更多关于 Formatting data for Spark ML

问题 I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread "main" java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,false) is not supported. It happens when using VectorAssembler. val generatedData = KMeansDataGenerator.generateKMeansRDD(sc, numPoints = 1000, k = 5, d = 3, r = 5, numPartitions

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

阅读更多关于 Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:///C:/files/project/file.csv", "/file.csv" */ Dataset<Row> logData = spark_session.read().csv(logFile);

Adding a Vectors Column to a pyspark DataFrame

阅读更多关于 Adding a Vectors Column to a pyspark DataFrame

问题 How do I add a Vectors.dense column to a pyspark dataframe? import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.ml.linalg import DenseVector py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]}) sc = SparkContext(master="local") sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(py_df) sdf.withColumn("features", DenseVector(1)) Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql

PySpark ML: Get KMeans cluster statistics

阅读更多关于 PySpark ML: Get KMeans cluster statistics

问题 I have built a KMeansModel. My results are stored in a PySpark DataFrame called transformed . (a) How do I interpret the contents of transformed ? (b) How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters? from pyspark.ml.clustering import KMeans # Trains a k-means model. kmeans = KMeans().setK(14).setSeed(1) model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional

Get Column Names after columnSimilarties() Spark scala

阅读更多关于 Get Column Names after columnSimilarties() Spark scala

问题 I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add

How to deserialize Pipeline model in spark.ml?

阅读更多关于 How to deserialize Pipeline model in spark.ml?

问题 I have serialized a Spark ML Pipeline model that consists of a number of TransformerS (org.apache.spark.ml.Transformer) and several Logistic Regression learners (org.apache.spark.ml.classification.LogisticRegression). It all works fine on my Windows machine where I created the model. I serialized the model to disk using java.io.ObjectOutputStream and read it back in using java.io.ObjectInputStream. It all works fine via sbt and my corresponding unit tests. However, when I assemble my code

How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

阅读更多关于 How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

问题 I am using Spark cluster 2.0 and I would like to convert a vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT . # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() modelA = lr.fit(data, {lr.regParam:0.0}) Error: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg