apache-spark

Load Dataset from Dynamically generated Case Class

蹲街弑〆低调 提交于 2021-02-04 08:06:24
问题 What is Needed: number of tables in source database are changing rapidly and thus I don't want to edit case classes so I dynamically generate them through SCALA code and put in package. But now not able to read it dynamically. If this works than I would parse "com.example.datasources.fileSystemSource.schema.{}" as object schema members in loop What has already been Done: I have some case classes dynamically generated from schema of database tables as below: object schema{ case class Users

Why does SparkSQL require two literal escape backslashes in the SQL query?

旧巷老猫 提交于 2021-02-04 07:13:39
问题 When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession = SparkSession.builder.master("local").getOrCreate() // Use SparkSQL to split a string val query = "SELECT split('What is this? A string I think', '\\\\?') AS result" println("The query is: " + query) val dataframe = sparkSession.sql(query) // Show the result dataframe

Why does SparkSQL require two literal escape backslashes in the SQL query?

╄→гoц情女王★ 提交于 2021-02-04 07:13:36
问题 When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession = SparkSession.builder.master("local").getOrCreate() // Use SparkSQL to split a string val query = "SELECT split('What is this? A string I think', '\\\\?') AS result" println("The query is: " + query) val dataframe = sparkSession.sql(query) // Show the result dataframe

Spark Shell Add Multiple Drivers/Jars to Classpath using spark-defaults.conf

前提是你 提交于 2021-02-04 06:51:17
问题 We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma like spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar But its not working, Can anyone please provide details for correct syntax 回答1: As an example in addition to Prateek's

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

走远了吗. 提交于 2021-01-29 22:41:16
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

In Spark, how to do One Hot Encoding for top N frequent values only?

◇◆丶佛笑我妖孽 提交于 2021-01-29 22:22:16
问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0

In Spark, how to do One Hot Encoding for top N frequent values only?

二次信任 提交于 2021-01-29 21:47:51
问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0

Discard Bad record and load only good records to dataframe from json file in pyspark

不羁的心 提交于 2021-01-29 21:35:52
问题 The API generated json file looks like below. The Format of the JSON file is not correct. can we handle the bad records to discard and load only good rows to dataframe using pyspark. { "name": "PowerAmplifier", "Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ", "url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/amplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime":

how to extract data JSON from zeppelin sql

家住魔仙堡 提交于 2021-01-29 21:31:38
问题 I query to test_tbl table on Zeppelin. the table data structure looks like as below : %sql desc stg.test_tbl col_name | data_type | comment id | string | title | string | tags | string | The tags column has data JSON type following as : {"name":[{"family": null, "first": "nelson"}, {"pos_code":{"house":"tlv", "id":"A12YR"}}]} and I want to see the JSON data with columns, so my query is : select *, tag.* from stg.test_tbl as t lateral view explode(t.tags.name) name as name lateral view explode

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

亡梦爱人 提交于 2021-01-29 20:34:26
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following