问题
I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings
.
I used regexp_extract
function, but it's returned an error because the regexp_extract
accept only a strings.
example dataframe:
id | last_name | age | Identificator
------------------------------------------------------------------
12 | AA | 23 | "[""AZE","POI","76759","T86420","ADAPT"]"
------------------------------------------------------------------
24 | BB | 24 | "[""SDN","34","35","AZE","21054","20126"]"
------------------------------------------------------------------
I want to extract all numbers that:
- contain 4, 5 or 6 digits
- it should not attached to a letters.
- if attached to letter Z ok, I should extract it.
- save it in a new column in my Dataframe.
I started to do it like this but it doesn't work because the title is an array of string.
expression = r'([0-9]){4,6}'
df = df.withColumn("extract", F.regexp_extract(F.col("Identificator"), expression, 1))
How can I extract these numbers using regexp_extract or another solution ? Thank you
回答1:
Here is what I can do using SparkSQL 2.4.0+ builtin function filter:
from pyspark.sql.functions import expr
df.withColumn('text_new', expr('filter(text, x -> x rlike "^Z?[0-9]{4,6}$")')) \
.show(truncate=False)
#+-----------------------------------+---------------------+
#|text |text_new |
#+-----------------------------------+---------------------+
#|[AZE, POI, 76759, T86420, ADAPT] |[76759] |
#|[SDN, 34, Z8735, AZE, 21054, 20126]|[Z8735, 21054, 20126]|
#+-----------------------------------+---------------------+
The result is an array containing matched items. the regex ^Z?[0-9]{4,6}$
matches 4-6 digits optionally preceded by a character 'Z'.
Edit: for older version Apache Spark, use udf():
import re
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
# regex pattern:
ptn = re.compile('^Z?[0-9]{4,6}$')
# create an udf to filter array
array_filter = udf(lambda arr: [ x for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))
df.withColumn('text_new', array_filter('text')) \
.show(truncate=False)
Edit-2: base on your comment, from 'Z' to 'MOD' and remove the leading MOD
, use lstrip() to remove this substring. adjust the following:
ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')
array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))
来源:https://stackoverflow.com/questions/58374905/how-use-on-array