pyspark | 易学教程

pySpark, aggregate complex function (difference of consecutive events)

阅读更多关于 pySpark, aggregate complex function (difference of consecutive events)

问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite

how count in pyspark? [closed]

阅读更多关于 how count in pyspark? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 24 days ago . Improve this question I have a huge list of title. I wanna count each title in whole data set. for example: `title` A b A c c c output: title fre A 2 b 1 c 3 回答1: You can just groupBy title and then count : import pyspark.sql.functions as f df.groupBy('title').agg(f.count('*')

Normalize a complex nested JSON file

阅读更多关于 Normalize a complex nested JSON file

问题 Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table" { "id": "0000050a", "revision": 1580225050941, "slot": "product-description", "type": "E", "create_date": 1580225050941, "modified_date": 1580225050941, "creator": "Auto", "modifier": "Auto", "audit_info": { "date": 1580225050941, "source": "AutoService", "username": "Auto" }, "total_ID": 1, "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW", "content": [{ "ID": ["B01"],

pyspark: groupby and aggregate avg and first on multiple columns

阅读更多关于 pyspark: groupby and aggregate avg and first on multiple columns

问题 I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'], ['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4']) +---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | a| 2| 4| cc| anc| | a| 4| 7| cd| abc| | b| 6| 0|

Spark structured streaming with kafka leads to only one batch (Pyspark)

阅读更多关于 Spark structured streaming with kafka leads to only one batch (Pyspark)

问题 I have the following code and I'm wondering why it generates only one batch: df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load() // groupby on slidings windows query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start() The application is launched with the following parameters: spark.streaming.backpressure.initialRate 5 spark.streaming.backpressure

Unable to write PySpark Dataframe created from two zipped dataframes

阅读更多关于 Unable to write PySpark Dataframe created from two zipped dataframes

问题 I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept): My Code left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303 joined_schema = StructType(left_df.schema.fields + right_df.schema.fields) interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1]) full_data = spark.createDataFrame

pyspark : Flattening of records coming from input file

阅读更多关于 pyspark : Flattening of records coming from input file

问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

Load XML string from Column in PySpark

阅读更多关于 Load XML string from Column in PySpark

问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')

How to import referenced files in ETL scripts?

阅读更多关于 How to import referenced files in ETL scripts?

问题 I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script? I've tried from configuration import * , where the referenced file name is configuration.py , but no luck (ImportError: No module named configuration). 回答1: I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests

Pivot row to column level

阅读更多关于 Pivot row to column level

问题 I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t : | yyyy_mm_dd | x_id | x_name | b_app | status | has_policy | count | |------------|------|-------------|---------|---------------|------------|-------| | 2020-08-18 | 1 | first_name | content | no_contact | 1 | 23 | | 2020-08-18 | 1 | first_name | content | no_contact | 0 | 346 | | 2020-08-18 | 2 | second_name | content | implemented | 1 | 64 | | 2020-08-18 | 2 | second_name |