apache-spark

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

半城伤御伤魂 提交于 2021-02-07 09:07:43
问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1:

Connecting oracle database using spark with Kerberos authentication?

本小妞迷上赌 提交于 2021-02-07 09:01:15
问题 My jdbc is connected to oracle database using krb5loginmodule without any issue, by giving keytab file or ticket cache.But, due to performance, I want to connect my oracle database using Spark. If I use simple username and password, I am able to connect my spark application to Oracle database using below snippet: Dataset<Row> empDF = sparkSession.read().format("jdbc") .option("url", "jdbc:oracle:thin:hr/1234@//127.0.0.1:1522/orcl") .option("dbtable", "hr.employees") //.option("user", "hr") //

Connecting oracle database using spark with Kerberos authentication?

独自空忆成欢 提交于 2021-02-07 08:57:15
问题 My jdbc is connected to oracle database using krb5loginmodule without any issue, by giving keytab file or ticket cache.But, due to performance, I want to connect my oracle database using Spark. If I use simple username and password, I am able to connect my spark application to Oracle database using below snippet: Dataset<Row> empDF = sparkSession.read().format("jdbc") .option("url", "jdbc:oracle:thin:hr/1234@//127.0.0.1:1522/orcl") .option("dbtable", "hr.employees") //.option("user", "hr") //

Delete functionality with spark sql dataframe

泪湿孤枕 提交于 2021-02-07 08:47:36
问题 I have a requirement to do a load/delete specific records from postgres db for my spark application. For loading , I am using spark dataframe in the below format sqlContext.read.format("jdbc").options(Map("url" -> "postgres url", "user" -> "user" , "password" -> "xxxxxx" , "table" -> "(select * from employee where emp_id > 1000) as filtered_emp")).load() To delete the data, I am writing direct sql instead of using dataframes delete from employee where emp_id > 1000 The question is , is there

Iterate each row in a dataframe, store it in val and pass as parameter to Spark SQL query

浪尽此生 提交于 2021-02-07 08:44:35
问题 I am trying to fetch rows from a lookup table (3 rows and 3 columns) and iterate row by row and pass values in each row to a SPARK SQL as parameters. DB | TBL | COL ---------------- db | txn | ID db | sales | ID db | fee | ID I tried this in spark shell for one row, it worked. But I am finding it difficult to iterate over rows. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val db_name:String = "db" val tbl_name:String = "transaction" val unique_col:String = "transaction_number" val

Iterate each row in a dataframe, store it in val and pass as parameter to Spark SQL query

此生再无相见时 提交于 2021-02-07 08:43:23
问题 I am trying to fetch rows from a lookup table (3 rows and 3 columns) and iterate row by row and pass values in each row to a SPARK SQL as parameters. DB | TBL | COL ---------------- db | txn | ID db | sales | ID db | fee | ID I tried this in spark shell for one row, it worked. But I am finding it difficult to iterate over rows. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val db_name:String = "db" val tbl_name:String = "transaction" val unique_col:String = "transaction_number" val

Spark CSV 2.1 File Names

假装没事ソ 提交于 2021-02-07 08:32:28
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Spark CSV 2.1 File Names

大城市里の小女人 提交于 2021-02-07 08:32:26
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Mocking SparkSession for unit testing

本小妞迷上赌 提交于 2021-02-07 08:15:24
问题 I have a method in my spark application that loads the data from a MySQL database. the method looks something like this. trait DataManager { val session: SparkSession def loadFromDatabase(input: Input): DataFrame = { session.read.jdbc(input.jdbcUrl, s"(${input.selectQuery}) T0", input.columnName, 0L, input.maxId, input.parallelism, input.connectionProperties) } } The method does nothing else other than executing jdbc method and loads data from the database. How can I test this method? The

Mocking SparkSession for unit testing

余生颓废 提交于 2021-02-07 08:14:03
问题 I have a method in my spark application that loads the data from a MySQL database. the method looks something like this. trait DataManager { val session: SparkSession def loadFromDatabase(input: Input): DataFrame = { session.read.jdbc(input.jdbcUrl, s"(${input.selectQuery}) T0", input.columnName, 0L, input.maxId, input.parallelism, input.connectionProperties) } } The method does nothing else other than executing jdbc method and loads data from the database. How can I test this method? The