apache-spark

How to join two JDBC tables and avoid Exchange?

╄→гoц情女王★ 提交于 2021-02-09 03:00:56
问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

How to join two JDBC tables and avoid Exchange?

泪湿孤枕 提交于 2021-02-09 03:00:26
问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

spark - set null when column not exist in dataframe

假如想象 提交于 2021-02-09 02:51:04
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

狂风中的少年 提交于 2021-02-09 02:49:16
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

Spark packages flag vs jars dir?

别来无恙 提交于 2021-02-09 02:48:37
问题 In Spark, what's the difference between adding JARs to the classpath via --packages argument and just adding the JARs directly to the $SPARK_HOME/jars directory? 回答1: TL;DR jars are used for local or remote jar files specified with URL and dont resolve dependencies, packages are used for Maven coordinates, and do resolve dependencies. From docs --jars When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the

spark - set null when column not exist in dataframe

↘锁芯ラ 提交于 2021-02-09 02:48:23
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

Spark packages flag vs jars dir?

可紊 提交于 2021-02-09 02:47:15
问题 In Spark, what's the difference between adding JARs to the classpath via --packages argument and just adding the JARs directly to the $SPARK_HOME/jars directory? 回答1: TL;DR jars are used for local or remote jar files specified with URL and dont resolve dependencies, packages are used for Maven coordinates, and do resolve dependencies. From docs --jars When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the

How to sort a column with Date and time values in Spark?

假如想象 提交于 2021-02-08 15:12:08
问题 Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expected Output: 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM 06-NOV-15 03.42.21.000000000 PM 04-NOV-16 03.36.13.000000000 PM 回答1: As this format is not standard, you need to use the unix_timestamp function to parse the string and

Can Dataframe joins in Spark preserve order?

余生颓废 提交于 2021-02-08 13:50:43
问题 I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table? E.g., +------------+---------

Creating datetime from string column in Pyspark [duplicate]

守給你的承諾、 提交于 2021-02-08 12:11:23
问题 This question already has answers here : Convert pyspark string to date format (6 answers) Closed 3 years ago . Suppose I have the following datetime column as shown below. I want to convert the column in string to a datetime type so I can extract months, days and year and such. +---+------------+ |agg| datetime| +---+------------+ | A|1/2/17 12:00| | B| null| | C|1/4/17 15:00| +---+------------+ I have tried the following code below, but the returning values in the datetime column are nulls,