I have an issue with regex extract with multiple matches

前端未结

关注

 1  1186

I am trying to extract 60 ML and 0.5 ML from the string \"60 ML of paracetomol and 0.5 ML of XYZ\" . This string is part of a column X in spark dataframe. Though I am able t

相关标签:

1条回答

感情败类

2020-12-20 04:15

Here is how you can do it with a python UDF:

from pyspark.sql.types import *
from pyspark.sql.functions import *
import re

data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')

# Define the function you want to return
def extract(s)
    all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
    return all_matches

# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))

# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))

Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.

pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)

df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

0 讨论(0)