apache-spark-ml

Create labeledPoints from Spark DataFrame in Python

一世执手 提交于 2019-11-28 06:04:25
What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this .map() function: def parsePoint(line): listmp = list(line.split('\t')) dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose() dataframe.insert(0, 'status', dataframe['accepted']) if 'NULL' in dataframe.columns: dataframe = dataframe.drop('NULL', axis=1) if '' in dataframe.columns: dataframe = dataframe.drop('', axis=1) if 'rejected'

Spark Structured Streaming and Spark-Ml Regression

给你一囗甜甜゛ 提交于 2019-11-28 02:22:17
Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources. How I'm supposed to apply regressions on structured streaming sources? (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink) user8371915 Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track

Why does StandardScaler not attach metadata to the output column?

妖精的绣舞 提交于 2019-11-28 01:38:36
I noticed that the ml StandardScaler does not attach metadata to the output column: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature._ val df = spark.read.option("header", true) .option("inferSchema", true) .csv("/path/to/cars.data") val strId1 = new StringIndexer() .setInputCol("v7") .setOutputCol("v7_IDX") val strId2 = new StringIndexer() .setInputCol("v8") .setOutputCol("v8_IDX") val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX")) .setOutputCol("featuresRaw") val scalerModel = new

how to train a ML model in sparklyr and predict new values on another dataframe?

一世执手 提交于 2019-11-27 22:33:38
Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive Bayes example where class identifies documents falling into the China category. I am able to run a

How to extract model hyper-parameters from spark.ml in PySpark?

别来无恙 提交于 2019-11-27 20:32:52
问题 I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0),

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

自古美人都是妖i 提交于 2019-11-27 19:55:16
I found the same discussion in comments section of Create a custom Transformer in PySpark ML , but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025 . Given that there is no option provided by Pyspark ML pipeline for saving a custom transformer written in python, what are the other options to get it done? How can I implement the _to_java method in my python class that returns a compatible java object? As of Spark 2.3.0 there's a much , much better way to do this. Simply extend DefaultParamsWritable and

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

偶尔善良 提交于 2019-11-27 18:02:04
I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows: pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) where data is a Spark DataFrame with one column labed features wich is a DenseVector of 3 dimensions: data.take(1) Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1') After fitting, I transform the data: transformed = model.transform(data) transformed.first() Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0

How do I convert an array (i.e. list) column to Vector

筅森魡賤 提交于 2019-11-27 17:35:40
Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession ): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York", temperatures=[-7.0, -7.0, -5.0]), ] df = spark.createDataFrame(source_data) Notice that the temperatures field is a list of floats. I would like to convert these lists of floats to the MLlib type Vector , and I'd like this conversion to be expressed using the basic DataFrame API rather than going via RDDs (which is inefficient because it sends all data from the

Is it possible to access estimator attributes in spark.ml pipelines?

陌路散爱 提交于 2019-11-27 16:23:00
问题 I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a spark.ml equivalent of sklearn's pipeline.named_steps feature? I found this answer which gives two options. The first works if I take the k-means model out of my pipeline and fit it separately, but that kinda defeats the purpose of a pipeline. The second

How to map variable names to features after pipeline

我的梦境 提交于 2019-11-27 14:07:52
问题 I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df)