PySpark Dataframe cast two columns into new column of tuples based value of a third column
问题 As the subject describes, I have a PySpark Dataframe that I need to cast two columns into a new column that is a list of tuples based the value of a third column. This cast will reduce or flatten the dataframe by a key value, product id in this case, and the result os one row per key. There are hundreds of millions of rows in this dataframe, with 37M unique product ids. Therefore I need a way to do the transformation on the spark cluster without bringing back any data to the driver (Jupyter