pyspark

“resolved attribute(s) missing” when performing join on pySpark

假装没事ソ 提交于 2020-01-01 02:10:09
问题 I have the following two pySpark dataframe: > df_lag_pre.columns ['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity'] > df_unmatched.columns ['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price'] Now I want to join them on common columns, so I try the following: > df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price']) And I get the following error message: AnalysisException: u'resolved attribute(s) price#3424

How to use lag and rangeBetween functions on timestamp values?

孤者浪人 提交于 2019-12-31 23:08:32
问题 I have data that looks like this: userid,eventtime,location_point 4e191908,2017-06-04 03:00:00,18685891 4e191908,2017-06-04 03:04:00,18685891 3136afcb,2017-06-04 03:03:00,18382821 661212dd,2017-06-04 03:06:00,80831484 40e8a7c3,2017-06-04 03:12:00,18825769 I would like to add a new boolean column that marks true if there are 2 or more userid within a 5 minutes window in the same location_point . I had an idea of using lag function to lookup over a window partitioned by the userid and with the

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

こ雲淡風輕ζ 提交于 2019-12-31 10:18:21
问题 import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (0, 5, float(10)), (1, 6, float('nan')), (0, 6, float('nan'))], ('session', "timestamp1", "id2")) +-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ | 1| 1|null| | 1| 2| 5.0| | 1| 3| NaN| | 1| 4|null| | 0| 5|10.0| | 1| 6| NaN| | 0| 6| NaN| +-------+----------+----+ How to replace value of timestamp1 column with value 999 when session==0? Expected output +---

Save a large Spark Dataframe as a single json file in S3

蹲街弑〆低调 提交于 2019-12-31 09:33:09
问题 Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this : dataframe.repartition(1).save("s3n://mybucket/testfile","json") But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB. Is it possible to use S3 multipart upload with Spark? or there is another way to solve this? Btw i need the data in a single file because another

How to create an empty DataFrame? Why “ValueError: RDD is empty”?

一个人想着一个人 提交于 2019-12-31 08:59:09
问题 I am trying to create an empty dataframe in Spark (Pyspark). I am using similar approach to the one discussed here enter link description here, but it is not working. This is my code df = sqlContext.createDataFrame(sc.emptyRDD(), schema) This is the error Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame rdd, schema = self._createFromRDD(data, schema,

OutOfMemoryError when using PySpark to read files in local mode

狂风中的少年 提交于 2019-12-31 06:41:48
问题 I have about a dozen gpg-encrypted files containing data I'd like to analyze using PySpark. My strategy is to apply a decryption function as a flat map to each file and then proceed processing at the record level: def read_fun_generator(filename): with gpg_open(filename[0].split(':')[-1], 'r') as f: for line in f: yield line.strip() gpg_files = sc.wholeTextFiles(/path/to/files/*.gpg) rdd_from_gpg = gpg_files.flatMap(read_fun_generator).map(lambda x: x.split('|')) rdd_from_gpg.count() # <--

TypeError: object of type 'map' has no len() Python3

廉价感情. 提交于 2019-12-31 05:33:27
问题 I'm trying to implement KMeans algorithm using Pyspark it gives me the above error in the last line of the while loop. it works fine outside the loop but after I created the loop it gave me this error How do I fix this ? # Find K Means of Loudacre device status locations # # Input data: file(s) with device status data (delimited by '|') # including latitude (13th field) and longitude (14th field) of device locations # (lat,lon of 0,0 indicates unknown location) # NOTE: Copy to pyspark using

passing value of RDD to another RDD as variable - Spark #Pyspark [duplicate]

≡放荡痞女 提交于 2019-12-31 05:18:05
问题 This question already has answers here : How to get a value from the Row object in Spark Dataframe? (3 answers) Closed last year . I am currently exploring how to call big hql files (contains 100 line of an insert into select statement) via sqlContext. Another thing is, The hqls files are parameterize, so while calling it from sqlContext, I want to pass the parameters as well. Have gone through loads of blogs and posts, but not found any answers to this. Another thing I was trying, to store

PySpark: read, map and reduce from multiline record textfile with newAPIHadoopFile

情到浓时终转凉″ 提交于 2019-12-31 04:45:09
问题 I'm trying so solve a problem that is kind of similar to this post. My original data is a text file that contains values (observations) of several sensors. Each observation is given with a timestamp but the sensor name is given only once, and not in each line. But there a several sensors in one file. Time MHist::852-YF-007 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 2016-05-09 23:57:00 0 2016-05-09 23:56:00 0 2016-05-09 23:55:00 0 2016-05-09 23:54:00 0 2016-05-09 23:53

Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

谁说胖子不能爱 提交于 2019-12-31 04:39:08
问题 I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text. I am using the following code to make the model from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext, Row from pyspark.ml.feature import CountVectorizer from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover from pyspark.mllib.clustering