pyspark

Question about joining dataframes in Spark

柔情痞子 提交于 2019-12-21 17:28:55
问题 Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left')

pyspark; check if an element is in collect_list [duplicate]

試著忘記壹切 提交于 2019-12-21 16:55:11
问题 This question already has answers here : How to filter based on array value in PySpark? (2 answers) Closed last year . I am working on a dataframe df , for instance the following dataframe: df.show() Output: +----+------+ |keys|values| +----+------+ | aa| apple| | bb|orange| | bb| desk| | bb|orange| | bb| desk| | aa| pen| | bb|pencil| | aa| chair| +----+------+ I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

How to calculate rolling median in PySpark using Window()?

北城以北 提交于 2019-12-21 14:07:29
问题 How do I calculate rolling median of dollar for a window size of previous 3 values? Input data dollars timestampGMT 25 2017-03-18 11:27:18 17 2017-03-18 11:27:19 13 2017-03-18 11:27:20 27 2017-03-18 11:27:21 13 2017-03-18 11:27:22 43 2017-03-18 11:27:23 12 2017-03-18 11:27:24 Expected Output data dollars timestampGMT rolling_median_dollar 25 2017-03-18 11:27:18 median(25) 17 2017-03-18 11:27:19 median(17,25) 13 2017-03-18 11:27:20 median(13,17,25) 27 2017-03-18 11:27:21 median(27,13,17) 13

Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

送分小仙女□ 提交于 2019-12-21 13:42:31
问题 I am using Apache Spark (Pyspark API for Python) ALS MLLIB to develop a service that performs live recommendations for anonym users (users not in the training set) in my site. In my usecase I train the model on the User ratings in this way: from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating ratings = df.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 10 model = ALS.trainImplicit(ratings, rank, numIterations) Now, each time an

PySpark: Add a column to DataFrame when column is a list

社会主义新天地 提交于 2019-12-21 13:00:01
问题 I have read similar questions but couldn't find a solution to my specific problem. I have a list l = [1, 2, 3] and a DataFrame df = sc.parallelize([ ['p1', 'a'], ['p2', 'b'], ['p3', 'c'], ]).toDF(('product', 'name')) I would like to obtain a new DataFrame where the list l is added as a further column, namely +-------+----+---------+ |product|name| new_col | +-------+----+---------+ | p1| a| 1 | | p2| b| 2 | | p3| c| 3 | +-------+----+---------+ Approaches with JOIN, where I was joining df

How to calculate lag difference in Spark Structured Streaming?

和自甴很熟 提交于 2019-12-21 12:28:39
问题 I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference. To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder: [ {"id": 77,"type": "person","timestamp": 1532609003}, {"id": 77,"type": "person","timestamp": 1532609005}, {"id": 78,"type": "crane","timestamp": 1532609005} ] Code: from pyspark.sql import SparkSession import pyspark.sql.functions as func from pyspark.sql.window import

How to load CSV file with records on multiple lines?

白昼怎懂夜的黑 提交于 2019-12-21 12:03:09
问题 I use Spark 2.3.0. As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following: answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True); answer_df.show(2) Output +--------------------+-------------+--------------------+--------+-----+--------------------+ | Id| OwnerUserId|

Emrfs file sync with s3 not working

不羁的心 提交于 2019-12-21 07:56:46
问题 After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write: 'bucket/folder' present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455) I tried running emrfs sync s3://bucket/folder which did not appear to resolve the

Pyspark dataframe LIKE operator

无人久伴 提交于 2019-12-21 07:09:36
问题 What is the equivalent in Pyspark for LIKE operator? For example I would like to do: SELECT * FROM table WHERE column LIKE "*somestring*"; looking for something easy like this (but this is not working): df.select('column').where(col('column').like("*s*")).show() 回答1: You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%' ). The col('col_name') is used to represent the condition and like is

AWS Glue pushdown predicate not working properly

微笑、不失礼 提交于 2019-12-21 06:26:41
问题 I'm trying to optimize my Glue/PySpark job by using push down predicates. start = date(2019, 2, 13) end = date(2019, 2, 27) print(">>> Generate data frame for ", start, " to ", end, "... ") relaventDatesDf = spark.createDataFrame([ Row(start=start, stop=end) ]) relaventDatesDf.createOrReplaceTempView("relaventDates") relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates") relaventDatesDf.createOrReplaceTempView("relaventDates")