I am trying to extract 60 ML and 0.5 ML from the string \"60 ML of paracetomol and 0.5 ML of XYZ\" . This string is part of a column X in spark dataframe. Though I am able t
Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))