apache-spark-ml

Creating and applying ml_lib pipeline with external parameter in sparklyr

柔情痞子 提交于 2019-12-24 11:24:20
问题 I am trying to create and apply a Spark ml_pipeline object that can handle an external parameter that will vary (typically a date). According to the Spark documentation, it seems possible: see part with ParamMap here I haven't tried exactly how to do it. I was thinking of something like this: table.df <- data.frame("a" = c(1,2,3)) table.sdf <- sdf_copy_to(sc, table.df) param = 5 param2 = 4 # operation declaration table2.sdf <- table.sdf %>% mutate(test = param) # pipeline creation pipeline_1

How to transform a csv string into a Spark-ML compatible Dataset<Row> format?

坚强是说给别人听的谎言 提交于 2019-12-24 09:52:49
问题 I have a Dataset<Row> df , that contains two columns ("key" and "value") of type string . df.printSchema(); is giving me the following output: root |-- key: string (nullable = true) |-- value: string (nullable = true) The content of the value column is actually a csv formated line (coming from a kafka topic), with the last entry of that line representing the class label and all the previous entries beeing the features (first row not included in the dataset): feature0,feature1,label 0

Spark: OneHot encoder and storing Pipeline (feature dimension issue)

元气小坏坏 提交于 2019-12-24 06:25:09
问题 We have a pipeline (2.0.1) consisting of multiple feature transformation stages. Some of these stages are OneHot encoders. Idea: classify an integer-based category into n independent features. When training the pipeline model and using it to predict all works fine. However, storing the trained pipeline model and reloading it causes issues: The stored 'trained' OneHot encoder does not keep track of how many categories there are. Loading it now causes issues: When the loaded model is used to

Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

China☆狼群 提交于 2019-12-24 02:05:14
问题 By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients. Is this possible with any of the dataset/dataframe-based APIs, preferably Scala? Looking at the Spark source code, it seems that there is a method setInitialModel to initialize

How to make binary classication in Spark ML without StringIndexer

爷,独闯天下 提交于 2019-12-23 04:24:09
问题 I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work: def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = { import sqlc.implicits._ val trainingDF = training.toDF() //format of this dataframe: [label: double, features: vector] val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

与世无争的帅哥 提交于 2019-12-23 03:46:06
问题 In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:

Convert Sparse Vector to Dense Vector in Pyspark

我的梦境 提交于 2019-12-22 08:10:19
问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

How can I train a random forest with a sparse matrix in Spark?

假如想象 提交于 2019-12-22 07:45:06
问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Dealing with dynamic columns with VectorAssembler

你说的曾经没有我的故事 提交于 2019-12-22 00:33:40
问题 Using sparks vector assembler the columns to be assembled need to be defined up front. However, if using the vector-assembler in a pipeline where the previous steps will modify the columns of the data frame how can I specify the columns without hard coding all the value manually? As df.columns will not contain the right values when the constructor is called of vector-assembler currently I do not see another way to handle that or to split the pipeline - which is bad as well because

How to pass params to a ML Pipeline.fit method?

帅比萌擦擦* 提交于 2019-12-21 20:20:07
问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I