apache-spark | 易学教程

How to join two JDBC tables and avoid Exchange?

阅读更多关于 How to join two JDBC tables and avoid Exchange?

问题 I've got ETL-like scenario, in which I read data from multiple JDBC tables and files and perform some aggregations and join between sources. In one step I must join two JDBC tables. I've tried to do something like: val df1 = spark.read.format("jdbc") .option("url", Database.DB_URL) .option("user", Database.DB_USER) .option("password", Database.DB_PASSWORD) .option("dbtable", tableName) .option("driver", Database.DB_DRIVER) .option("upperBound", data.upperBound) .option("lowerBound", data

How to join two JDBC tables and avoid Exchange?

阅读更多关于 How to join two JDBC tables and avoid Exchange?

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

Spark packages flag vs jars dir?

阅读更多关于 Spark packages flag vs jars dir?

问题 In Spark, what's the difference between adding JARs to the classpath via --packages argument and just adding the JARs directly to the $SPARK_HOME/jars directory? 回答1: TL;DR jars are used for local or remote jar files specified with URL and dont resolve dependencies, packages are used for Maven coordinates, and do resolve dependencies. From docs --jars When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the

spark - set null when column not exist in dataframe

阅读更多关于 spark - set null when column not exist in dataframe

Spark packages flag vs jars dir?

阅读更多关于 Spark packages flag vs jars dir?

How to sort a column with Date and time values in Spark?

阅读更多关于 How to sort a column with Date and time values in Spark?

问题 Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expected Output: 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM 06-NOV-15 03.42.21.000000000 PM 04-NOV-16 03.36.13.000000000 PM 回答1: As this format is not standard, you need to use the unix_timestamp function to parse the string and

Can Dataframe joins in Spark preserve order?

阅读更多关于 Can Dataframe joins in Spark preserve order?

问题 I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table? E.g., +------------+---------

Creating datetime from string column in Pyspark [duplicate]

阅读更多关于 Creating datetime from string column in Pyspark [duplicate]

问题 This question already has answers here : Convert pyspark string to date format (6 answers) Closed 3 years ago . Suppose I have the following datetime column as shown below. I want to convert the column in string to a datetime type so I can extract months, days and year and such. +---+------------+ |agg| datetime| +---+------------+ | A|1/2/17 12:00| | B| null| | C|1/4/17 15:00| +---+------------+ I have tried the following code below, but the returning values in the datetime column are nulls,