Extract values from spark dataframe column into new derived column

问题

I have the following dataframe schema below

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

Question 1: I would like to derive a new column called "folder_num" and strip out the following:

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.

How can I achieve this in spark? I'm new to this technology so your help is much appreciated.

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column

Thank you for the help.

Added Code:

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

回答1:

Well you are on a good way:

from pyspark.sql.functions import regexp_extract, trim

df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')

df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()

this will trim and extract the pattern of group 1 that matches the regex expression

回答2:

You can use regexp_extract and when. Refer the sample scala spark code below.

  df.withColumn("folder_num",
  when(regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" ,1) =!= lit(""),
    regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" , 1))
    .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1))
).show(false)

+------------------------------------------------------+----------+
|LOCATION                                              |folder_num|
+------------------------------------------------------+----------+
|prod/docs/Folder1/AA160039/Folder2/XXX.pdf            |160039    |
|prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx|355       |
+------------------------------------------------------+----------+

If you need the output of first row to AA160039, just change the grouping in regex as below.

regexp_extract(col("LOCATION"),".*/([A-Z]{2}[0-9]{6})/.*" ,1)

回答3:

The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
                                                .otherwise("Unknown"))

Thanks.

来源：https://stackoverflow.com/questions/64602504/extract-values-from-spark-dataframe-column-into-new-derived-column

标签

apache-spark

pyspark

apache-spark-sql