I have an issue with regex extract with multiple matches

前端 未结 1 1186
走了就别回头了
走了就别回头了 2020-12-20 03:32

I am trying to extract 60 ML and 0.5 ML from the string \"60 ML of paracetomol and 0.5 ML of XYZ\" . This string is part of a column X in spark dataframe. Though I am able t

相关标签:
1条回答
  • 2020-12-20 04:15

    Here is how you can do it with a python UDF:

    from pyspark.sql.types import *
    from pyspark.sql.functions import *
    import re
    
    data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
    df = sc.parallelize(data).toDF('str:string')
    
    # Define the function you want to return
    def extract(s)
        all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
        return all_matches
    
    # Create the UDF, note that you need to declare the return schema matching the returned type
    extract_udf = udf(extract, ArrayType(StringType()))
    
    # Apply it
    df2 = df.withColumn('extracted', extract_udf('str'))
    

    Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.

    pattern = r'\d+(?:\.\d+)? ML'
    split_pattern = r'.*?({pattern})'.format(pattern=pattern)
    end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
    
    df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
    df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
    df4 = df3.withColumn('a', split('a', r','))
    
    0 讨论(0)
提交回复
热议问题