pyspark | 易学教程

Question about joining dataframes in Spark

阅读更多关于 Question about joining dataframes in Spark

问题 Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left')

pyspark; check if an element is in collect_list [duplicate]

阅读更多关于 pyspark; check if an element is in collect_list [duplicate]

问题 This question already has answers here : How to filter based on array value in PySpark? (2 answers) Closed last year . I am working on a dataframe df , for instance the following dataframe: df.show() Output: +----+------+ |keys|values| +----+------+ | aa| apple| | bb|orange| | bb| desk| | bb|orange| | bb| desk| | aa| pen| | bb|pencil| | aa| chair| +----+------+ I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

How to calculate rolling median in PySpark using Window()?

阅读更多关于 How to calculate rolling median in PySpark using Window()?

问题 How do I calculate rolling median of dollar for a window size of previous 3 values? Input data dollars timestampGMT 25 2017-03-18 11:27:18 17 2017-03-18 11:27:19 13 2017-03-18 11:27:20 27 2017-03-18 11:27:21 13 2017-03-18 11:27:22 43 2017-03-18 11:27:23 12 2017-03-18 11:27:24 Expected Output data dollars timestampGMT rolling_median_dollar 25 2017-03-18 11:27:18 median(25) 17 2017-03-18 11:27:19 median(17,25) 13 2017-03-18 11:27:20 median(13,17,25) 27 2017-03-18 11:27:21 median(27,13,17) 13

Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

阅读更多关于 Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

问题 I am using Apache Spark (Pyspark API for Python) ALS MLLIB to develop a service that performs live recommendations for anonym users (users not in the training set) in my site. In my usecase I train the model on the User ratings in this way: from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating ratings = df.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 10 model = ALS.trainImplicit(ratings, rank, numIterations) Now, each time an

PySpark: Add a column to DataFrame when column is a list

阅读更多关于 PySpark: Add a column to DataFrame when column is a list

问题 I have read similar questions but couldn't find a solution to my specific problem. I have a list l = [1, 2, 3] and a DataFrame df = sc.parallelize([ ['p1', 'a'], ['p2', 'b'], ['p3', 'c'], ]).toDF(('product', 'name')) I would like to obtain a new DataFrame where the list l is added as a further column, namely +-------+----+---------+ |product|name| new_col | +-------+----+---------+ | p1| a| 1 | | p2| b| 2 | | p3| c| 3 | +-------+----+---------+ Approaches with JOIN, where I was joining df

How to calculate lag difference in Spark Structured Streaming?

阅读更多关于 How to calculate lag difference in Spark Structured Streaming?

问题 I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference. To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder: [ {"id": 77,"type": "person","timestamp": 1532609003}, {"id": 77,"type": "person","timestamp": 1532609005}, {"id": 78,"type": "crane","timestamp": 1532609005} ] Code: from pyspark.sql import SparkSession import pyspark.sql.functions as func from pyspark.sql.window import

How to load CSV file with records on multiple lines?

阅读更多关于 How to load CSV file with records on multiple lines?

问题 I use Spark 2.3.0. As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following: answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True); answer_df.show(2) Output +--------------------+-------------+--------------------+--------+-----+--------------------+ | Id| OwnerUserId|

Emrfs file sync with s3 not working

阅读更多关于 Emrfs file sync with s3 not working

问题 After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write: 'bucket/folder' present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455) I tried running emrfs sync s3://bucket/folder which did not appear to resolve the

Pyspark dataframe LIKE operator

阅读更多关于 Pyspark dataframe LIKE operator

问题 What is the equivalent in Pyspark for LIKE operator? For example I would like to do: SELECT * FROM table WHERE column LIKE "*somestring*"; looking for something easy like this (but this is not working): df.select('column').where(col('column').like("*s*")).show() 回答1: You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%' ). The col('col_name') is used to represent the condition and like is

AWS Glue pushdown predicate not working properly

阅读更多关于 AWS Glue pushdown predicate not working properly

问题 I'm trying to optimize my Glue/PySpark job by using push down predicates. start = date(2019, 2, 13) end = date(2019, 2, 27) print(">>> Generate data frame for ", start, " to ", end, "... ") relaventDatesDf = spark.createDataFrame([ Row(start=start, stop=end) ]) relaventDatesDf.createOrReplaceTempView("relaventDates") relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates") relaventDatesDf.createOrReplaceTempView("relaventDates")