apache-spark-sql

pyspark : Flattening of records coming from input file

一世执手 提交于 2021-02-05 08:10:35
问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

How to get year and week number aligned for a date

自闭症网瘾萝莉.ら 提交于 2021-02-04 21:00:06
问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(

How to get year and week number aligned for a date

倖福魔咒の 提交于 2021-02-04 20:59:18
问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(

Spark treating null values in csv column as null datatype

点点圈 提交于 2021-02-04 18:07:22
问题 My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file. For example, I have input csv as follows: Id|FirstName|LastName|LocationId 1|John|Doe|123 2|Alex|Doe|234 My transformation is: Select Id, FirstName, LastName, LocationId as PrimaryLocationId, null as SecondaryLocationId from Input (I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the

How to speed up spark df.write jdbc to postgres database?

最后都变了- 提交于 2021-02-04 12:16:14
问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

How do I check for equality using Spark Dataframe without SQL Query?

守給你的承諾、 提交于 2021-02-04 09:14:41
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

此生再无相见时 提交于 2021-02-04 09:09:09
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

戏子无情 提交于 2021-02-04 09:09:07
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

Why does SparkSQL require two literal escape backslashes in the SQL query?

旧巷老猫 提交于 2021-02-04 07:13:39
问题 When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession = SparkSession.builder.master("local").getOrCreate() // Use SparkSQL to split a string val query = "SELECT split('What is this? A string I think', '\\\\?') AS result" println("The query is: " + query) val dataframe = sparkSession.sql(query) // Show the result dataframe

Why does SparkSQL require two literal escape backslashes in the SQL query?

╄→гoц情女王★ 提交于 2021-02-04 07:13:36
问题 When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession = SparkSession.builder.master("local").getOrCreate() // Use SparkSQL to split a string val query = "SELECT split('What is this? A string I think', '\\\\?') AS result" println("The query is: " + query) val dataframe = sparkSession.sql(query) // Show the result dataframe