apache-spark-sql | 易学教程

pyspark : Flattening of records coming from input file

阅读更多关于 pyspark : Flattening of records coming from input file

问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

How to get year and week number aligned for a date

阅读更多关于 How to get year and week number aligned for a date

问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(

How to get year and week number aligned for a date

阅读更多关于 How to get year and week number aligned for a date

Spark treating null values in csv column as null datatype

阅读更多关于 Spark treating null values in csv column as null datatype

问题 My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file. For example, I have input csv as follows: Id|FirstName|LastName|LocationId 1|John|Doe|123 2|Alex|Doe|234 My transformation is: Select Id, FirstName, LastName, LocationId as PrimaryLocationId, null as SecondaryLocationId from Input (I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the

How to speed up spark df.write jdbc to postgres database?

阅读更多关于 How to speed up spark df.write jdbc to postgres database?

问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

Why does SparkSQL require two literal escape backslashes in the SQL query?

阅读更多关于 Why does SparkSQL require two literal escape backslashes in the SQL query?

问题 When I run the below Scala code from the Spark 2.0 REPL (spark-shell), it runs as I intended it, splitting the string with a simple regular expression. import org.apache.spark.sql.SparkSession // Create session val sparkSession = SparkSession.builder.master("local").getOrCreate() // Use SparkSQL to split a string val query = "SELECT split('What is this? A string I think', '\\\\?') AS result" println("The query is: " + query) val dataframe = sparkSession.sql(query) // Show the result dataframe

Why does SparkSQL require two literal escape backslashes in the SQL query?

阅读更多关于 Why does SparkSQL require two literal escape backslashes in the SQL query?