apache-spark-ml

pyspark : NameError: name 'spark' is not defined

二次信任 提交于 2019-11-30 02:22:36
I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)] df = spark.createDataFrame(data, ["features"]) kmeans = KMeans(k=2, seed=1) model = kmeans.fit(df) However, the example above wouldn't run and gave me the following errors: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-28

Column name with dot spark

こ雲淡風輕ζ 提交于 2019-11-30 01:00:08
问题 I am trying to take columns from a DataFrame and convert it to an RDD[Vector] . The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what I'm doing : val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt") val column=df.columns.map(c=>s"`${c}`") val rows = new VectorAssembler().setInputCols(column)

How to prepare data into a LibSVM format from DataFrame?

折月煮酒 提交于 2019-11-29 22:32:53
I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation : val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble) } val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))} val usergroup = user.groupByKey val data =usergroup.map{

Issues with Logistic Regression for multiclass classification using PySpark

為{幸葍}努か 提交于 2019-11-29 22:29:42
问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

Spark: OneHot encoder and storing Pipeline (feature dimension issue)

最后都变了- 提交于 2019-11-29 15:32:47
We have a pipeline (2.0.1) consisting of multiple feature transformation stages. Some of these stages are OneHot encoders. Idea: classify an integer-based category into n independent features. When training the pipeline model and using it to predict all works fine. However, storing the trained pipeline model and reloading it causes issues: The stored 'trained' OneHot encoder does not keep track of how many categories there are. Loading it now causes issues: When the loaded model is used to predict, it redetermines how many categories there are, causing the training feature space and prediction

How to access individual trees in a model created by RandomForestClassifier (spark.ml-version)?

纵然是瞬间 提交于 2019-11-29 15:14:34
问题 How to access individual trees in a model generated by Spark ML's RandomForestClassifier? I am using the Scala version of RandomForestClassifier. 回答1: Actually it has trees attribute: import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.classification.{ RandomForestClassificationModel, RandomForestClassifier, DecisionTreeClassificationModel } val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0") .toMetadata val data = sqlContext.read

StandardScaler in Spark not working as expected

試著忘記壹切 提交于 2019-11-29 11:16:12
Any idea why spark would be doing this for StandardScaler ? As per the definition of StandardScaler : The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it. >>> tmpdf.show(4) +----+----+----+------------+ |int1|int2|int3|temp_feature| +----+----+----+------------+ | 1| 2| 3| [2.0]| | 7| 8| 9| [8.0]| | 4| 5| 6| [5.0]| +----+----+----+------------+ >>> sScaler = StandardScaler(withMean=True, withStd=True)

How to print the decision path / rules used to predict sample of a specific row in PySpark?

冷暖自知 提交于 2019-11-29 11:15:30
How to print the decision path of a specific sample in a Spark DataFrame? Spark Version: '2.3.1' The below code prints the decision path of the whole model, how to make it print a decision path of a specific sample? For example, the decision path of the row where tagvalue ball equals 2 import pyspark.sql.functions as F from pyspark.ml import Pipeline, Transformer from pyspark.sql import DataFrame from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import VectorAssembler import findspark findspark.init() from pyspark import SparkConf from pyspark.sql import

SparkException: Values to assemble cannot be null

女生的网名这么多〃 提交于 2019-11-29 02:07:48
I want use StandardScaler to normalize the features. Here is my code: val Array(trainingData, testData) = dataset.randomSplit(Array(0.7,0.3)) val vectorAssembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features").transform(trainingData) val stdscaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false).fit(vectorAssembler) but it threw out an exception when I tried to use StandardScaler [Stage 151:==> (9 + 2) / 200]16/12/28 20:13:57 WARN scheduler.TaskSetManager: Lost task 31.0 in stage 151.0 (TID 8922, slave1

Is it possible to access estimator attributes in spark.ml pipelines?

与世无争的帅哥 提交于 2019-11-29 01:42:32
I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel .clusterCenters after fitting the pipeline, but can't figure out how. Is there a spark.ml equivalent of sklearn's pipeline.named_steps feature? I found this answer which gives two options. The first works if I take the k-means model out of my pipeline and fit it separately, but that kinda defeats the purpose of a pipeline. The second option doesn't work - I get error: value getModel is not a member of org.apache.spark.ml.PipelineModel