I have an issue with regex extract with multiple matches

独自空忆成欢 提交于 2020-01-19 18:06:05

问题


I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.

Can you suggest me the best way of doing it using UDF ?


回答1:


Here is how you can do it with a python UDF:

from pyspark.sql.types import *
from pyspark.sql.functions import *
import re

data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')

# Define the function you want to return
def extract(s)
    all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
    return all_matches

# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))

# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))

Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.

pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)

df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))


来源:https://stackoverflow.com/questions/54597183/i-have-an-issue-with-regex-extract-with-multiple-matches

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!