apache-spark-ml | 易学教程

Creating and applying ml_lib pipeline with external parameter in sparklyr

阅读更多关于 Creating and applying ml_lib pipeline with external parameter in sparklyr

问题 I am trying to create and apply a Spark ml_pipeline object that can handle an external parameter that will vary (typically a date). According to the Spark documentation, it seems possible: see part with ParamMap here I haven't tried exactly how to do it. I was thinking of something like this: table.df <- data.frame("a" = c(1,2,3)) table.sdf <- sdf_copy_to(sc, table.df) param = 5 param2 = 4 # operation declaration table2.sdf <- table.sdf %>% mutate(test = param) # pipeline creation pipeline_1

How to transform a csv string into a Spark-ML compatible Dataset<Row> format?

阅读更多关于 How to transform a csv string into a Spark-ML compatible Dataset format?

问题 I have a Dataset<Row> df , that contains two columns ("key" and "value") of type string . df.printSchema(); is giving me the following output: root |-- key: string (nullable = true) |-- value: string (nullable = true) The content of the value column is actually a csv formated line (coming from a kafka topic), with the last entry of that line representing the class label and all the previous entries beeing the features (first row not included in the dataset): feature0,feature1,label 0

Spark: OneHot encoder and storing Pipeline (feature dimension issue)

阅读更多关于 Spark: OneHot encoder and storing Pipeline (feature dimension issue)

问题 We have a pipeline (2.0.1) consisting of multiple feature transformation stages. Some of these stages are OneHot encoders. Idea: classify an integer-based category into n independent features. When training the pipeline model and using it to predict all works fine. However, storing the trained pipeline model and reloading it causes issues: The stored 'trained' OneHot encoder does not keep track of how many categories there are. Loading it now causes issues: When the loaded model is used to

Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

阅读更多关于 Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

问题 By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients. Is this possible with any of the dataset/dataframe-based APIs, preferably Scala? Looking at the Spark source code, it seems that there is a method setInitialModel to initialize

How to make binary classication in Spark ML without StringIndexer

阅读更多关于 How to make binary classication in Spark ML without StringIndexer

问题 I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work: def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = { import sqlc.implicits._ val trainingDF = training.toDF() //format of this dataframe: [label: double, features: vector] val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

阅读更多关于 Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

问题 In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:

Convert Sparse Vector to Dense Vector in Pyspark

阅读更多关于 Convert Sparse Vector to Dense Vector in Pyspark

问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Dealing with dynamic columns with VectorAssembler

阅读更多关于 Dealing with dynamic columns with VectorAssembler

问题 Using sparks vector assembler the columns to be assembled need to be defined up front. However, if using the vector-assembler in a pipeline where the previous steps will modify the columns of the data frame how can I specify the columns without hard coding all the value manually? As df.columns will not contain the right values when the constructor is called of vector-assembler currently I do not see another way to handle that or to split the pipeline - which is bad as well because

How to pass params to a ML Pipeline.fit method?

阅读更多关于 How to pass params to a ML Pipeline.fit method?

问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I