apache-spark-sql

PySpark explode list into multiple columns based on name

若如初见. 提交于 2021-02-15 12:01:02
问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

Spark Data set transformation to array [duplicate]

↘锁芯ラ 提交于 2021-02-11 18:16:14
问题 This question already has answers here : How to aggregate values into collection after groupBy? (3 answers) Closed 8 months ago . I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case. Original Dataset: +---------------------| | col1 | col2 | +---------------------| | AA| 11 | | BB| 21 | | AA| 12 | | AA| 13 | |

Computing First Day of Previous Quarter in Spark SQL

不想你离开。 提交于 2021-02-11 17:55:52
问题 How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below: input_date | start_date ------------------------ 2020-01-21 | 2019-10-01 2020-02-06 | 2019-10-01 2020-04-15 | 2020-01-01 2020-07-10 | 2020-04-01 2020-10-20 | 2020-07-01 2021-02-04 | 2020-10-01 The Quarters generally are: 1 | Jan - Mar 2 | Apr - Jun 3 | Jul - Sep 4 | Oct - Dec Note:I am using Spark SQL v2.4. Any help is appreciated. Thanks.

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

独自空忆成欢 提交于 2021-02-11 16:59:59
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

放肆的年华 提交于 2021-02-11 16:57:30
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

Converting dataframe to dictionary in pyspark without using pandas

大城市里の小女人 提交于 2021-02-11 16:55:20
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

…衆ロ難τιáo~ 提交于 2021-02-11 16:54:15
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

spark sql lag, result gets different rows when I change column

允我心安 提交于 2021-02-11 14:32:41
问题 I'm trying to lag a field when it matches certain conditions, and because I need to use filters, I'm using the MAX function to lag it, as the LAG function itself doesn't work the way I need it. I have been able to do it with the code below for the ID_EVENT_LOG , but when I change the ID_EVENT_LOG inside the MAX to the column ENSAIO , so I would lag the column ENSAIO it doesn't work properly. Example below. Dataset: +------------+---------+------+ |ID_EVENT_LOG|ID_PAINEL|ENSAIO| +------------+

Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

元气小坏坏 提交于 2021-02-11 14:21:49
问题 There seem to be a few postings on this but none seem to answer what I understand. The following code run on DataBricks: spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7") val checkpointDir = spark.sparkContext.getCheckpointDir.get val ds = spark.range(10).repartition(2) ds.cache() ds.checkpoint() ds.count() ds.rdd.isCheckpointed Added an improvement of sorts: ... val ds2 = ds.checkpoint(eager=true) println(ds2.queryExecution.toRdd.toDebugString) ... returns: (2)

How to parse dynamic Json with dynamic keys inside it in Scala

一个人想着一个人 提交于 2021-02-11 12:56:58
问题 I am trying to parse Json structure which is dynamic in nature and load into database. But facing difficulty where json has dynamic keys inside it. Below is my sample json: Have tried using explode function but didn't help. moslty similar thing is described here How to parse a dynamic JSON key in a Nested JSON result? { "_id": { "planId": "5f34dab0c661d8337097afb9", "version": { "$numberLong": "1" }, "period": { "name" : "3Q20", "startDate": 20200629, "endDate": 20200927 }, "line": "b443e9c0