pyspark

PySpark Row objects: accessing row elements by variable names

末鹿安然 提交于 2021-02-16 05:16:46
问题 One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11) , one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element ? One option is to do r.toDict()[element] . However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like def f(row, element1, element2): row =

Joining two spark dataframes on time (TimestampType) in python

烂漫一生 提交于 2021-02-16 03:30:32
问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do

PySpark explode list into multiple columns based on name

若如初见. 提交于 2021-02-15 12:01:02
问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

PySpark Sql with column name containing dash/hyphen in it

时间秒杀一切 提交于 2021-02-11 17:37:33
问题 I've PySpark dataframe df data = {'Passenger-Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35}} df_pd = pd.DataFrame(data, columns=data.keys()) df = spark.createDataFrame(df_pd) +------------+---+ |Passenger-Id|Age| +------------+---+ | 1| 22| | 2| 38| | 3| 26| | 4| 35| | 5| 35| +------------+---+ This works df.filter(df.Age == 22).show() But below doesn't work, due to - in the column name df.filter(df.Passenger-Id == 2).show() AttributeError: 'DataFrame' object

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

独自空忆成欢 提交于 2021-02-11 16:59:59
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

放肆的年华 提交于 2021-02-11 16:57:30
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

Converting dataframe to dictionary in pyspark without using pandas

大城市里の小女人 提交于 2021-02-11 16:55:20
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

…衆ロ難τιáo~ 提交于 2021-02-11 16:54:15
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Reading from S3 in EMR

萝らか妹 提交于 2021-02-11 15:23:17
问题 I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR. I have read quite a few posts about it and have done the following to make it works : Add an IAM policy allowing read & write access to s3 Tried to pass the uris in the Argument section of the spark-submit request I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run : df = spark.read.option(

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

心已入冬 提交于 2021-02-11 15:05:00
问题 I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31). { "2018" : { "Jan": { "1": { "u": 1, "n": 2 } "2": { "u": 4, "n": 7 } }, "Feb": { "1": { "u": 3, "n": 2 }, "4": { "u": 4, "n": 5 } } } } I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure: dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table,