pyspark | 易学教程

PySpark Row objects: accessing row elements by variable names

阅读更多关于 PySpark Row objects: accessing row elements by variable names

问题 One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11) , one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element ? One option is to do r.toDict()[element] . However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like def f(row, element1, element2): row =

Joining two spark dataframes on time (TimestampType) in python

阅读更多关于 Joining two spark dataframes on time (TimestampType) in python

问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do

PySpark explode list into multiple columns based on name

阅读更多关于 PySpark explode list into multiple columns based on name

问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

PySpark Sql with column name containing dash/hyphen in it

阅读更多关于 PySpark Sql with column name containing dash/hyphen in it

问题 I've PySpark dataframe df data = {'Passenger-Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35}} df_pd = pd.DataFrame(data, columns=data.keys()) df = spark.createDataFrame(df_pd) +------------+---+ |Passenger-Id|Age| +------------+---+ | 1| 22| | 2| 38| | 3| 26| | 4| 35| | 5| 35| +------------+---+ This works df.filter(df.Age == 22).show() But below doesn't work, due to - in the column name df.filter(df.Passenger-Id == 2).show() AttributeError: 'DataFrame' object

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

Reading from S3 in EMR

阅读更多关于 Reading from S3 in EMR

问题 I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR. I have read quite a few posts about it and have done the following to make it works : Add an IAM policy allowing read & write access to s3 Tried to pass the uris in the Argument section of the spark-submit request I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run : df = spark.read.option(

How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

阅读更多关于 How to UnPivot COLUMNS into ROWS in AWS Glue / Py Spark script

问题 I have a large nested json document for each year (say 2018, 2017), which has aggregated data by each month (Jan-Dec) and each day (1-31). { "2018" : { "Jan": { "1": { "u": 1, "n": 2 } "2": { "u": 4, "n": 7 } }, "Feb": { "1": { "u": 3, "n": 2 }, "4": { "u": 4, "n": 5 } } } } I have used AWS Glue Relationalize.apply function to convert above hierarchal data into flat structure: dfc = Relationalize.apply(frame = datasource0, staging_path = my_temp_bucket, name = my_ref_relationalize_table,