问题
I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API:
@since(1.3) def getItem(self, key): """ An expression that gets an item at position ``ordinal`` out of a list, or gets an item by key out of a dict. @since(1.3) def getField(self, name): """ An expression that gets a field by name in a StructField.
Obviously this doesnt meet my requirements, for example for the text within the column "A_B_C_D" I would like to split between "A_B_C_" and "D" in two different columns.
This is the code I'm using
from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))
Find an example:
from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))
Result:
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn null
My expected result would be having always the last item
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn ThisShouldBeInTheLastColumn
回答1:
You can use getItem(size - 1)
to get the last item from the arrays:
Example:
df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
| split|
+------------+
|[A, B, C, D]|
| [E, F]|
+------------+
import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
| split|lastItem|
+------------+--------+
|[A, B, C, D]| D|
| [E, F]| F|
+------------+--------+
For your case:
from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))
回答2:
You can pass in a regular expression pattern to split.
The following would work for your example:
from pyspark.sql.functions split
split_col=split(df['fullString'], r"_(?=.+$)")
df = df.withColumn('LastItemOfSplit', split_col.getItem(1))
df.show(truncate=False)
#+--------------------------------------------------------+---------------------------+
#|fullString |LastItemOfSplit |
#+--------------------------------------------------------+---------------------------+
#|Item1_Item2_ItemN |Item2 |
#|FirstItem_SecondItem_LastItem |SecondItem |
#|ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn|ThisShouldBeInTheLastColumn|
#+--------------------------------------------------------+---------------------------+
The pattern means the following:
_
the literal underscore(?=.+$)
positive look-ahead for anything (.
) until the end of the string$
This will split the string on the last underscore. Then call .getItem(1)
to get the item at index 1 in the resultant list.
来源:https://stackoverflow.com/questions/55143035/pyspark-split-a-column-and-take-n-elements