pyspark

pySpark, aggregate complex function (difference of consecutive events)

こ雲淡風輕ζ 提交于 2021-02-05 11:15:48
问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite

how count in pyspark? [closed]

你。 提交于 2021-02-05 09:46:36
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 24 days ago . Improve this question I have a huge list of title. I wanna count each title in whole data set. for example: `title` A b A c c c output: title fre A 2 b 1 c 3 回答1: You can just groupBy title and then count : import pyspark.sql.functions as f df.groupBy('title').agg(f.count('*')

Normalize a complex nested JSON file

心不动则不痛 提交于 2021-02-05 09:37:41
问题 Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table" { "id": "0000050a", "revision": 1580225050941, "slot": "product-description", "type": "E", "create_date": 1580225050941, "modified_date": 1580225050941, "creator": "Auto", "modifier": "Auto", "audit_info": { "date": 1580225050941, "source": "AutoService", "username": "Auto" }, "total_ID": 1, "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW", "content": [{ "ID": ["B01"],

pyspark: groupby and aggregate avg and first on multiple columns

我的梦境 提交于 2021-02-05 09:26:33
问题 I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'], ['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4']) +---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | a| 2| 4| cc| anc| | a| 4| 7| cd| abc| | b| 6| 0|

Spark structured streaming with kafka leads to only one batch (Pyspark)

扶醉桌前 提交于 2021-02-05 08:47:26
问题 I have the following code and I'm wondering why it generates only one batch: df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load() // groupby on slidings windows query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start() The application is launched with the following parameters: spark.streaming.backpressure.initialRate 5 spark.streaming.backpressure

Unable to write PySpark Dataframe created from two zipped dataframes

有些话、适合烂在心里 提交于 2021-02-05 08:32:40
问题 I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept): My Code left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303 joined_schema = StructType(left_df.schema.fields + right_df.schema.fields) interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1]) full_data = spark.createDataFrame

pyspark : Flattening of records coming from input file

一世执手 提交于 2021-02-05 08:10:35
问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

Load XML string from Column in PySpark

点点圈 提交于 2021-02-05 07:20:25
问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')

How to import referenced files in ETL scripts?

被刻印的时光 ゝ 提交于 2021-02-05 07:11:32
问题 I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script? I've tried from configuration import * , where the referenced file name is configuration.py , but no luck (ImportError: No module named configuration). 回答1: I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests

Pivot row to column level

只愿长相守 提交于 2021-02-04 21:48:22
问题 I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t : | yyyy_mm_dd | x_id | x_name | b_app | status | has_policy | count | |------------|------|-------------|---------|---------------|------------|-------| | 2020-08-18 | 1 | first_name | content | no_contact | 1 | 23 | | 2020-08-18 | 1 | first_name | content | no_contact | 0 | 346 | | 2020-08-18 | 2 | second_name | content | implemented | 1 | 64 | | 2020-08-18 | 2 | second_name |