问题
I have the following dataframe schema below
root
|-- SOURCE: string (nullable = true)
|-- SYSTEM_NAME: string (nullable = true)
|-- BUCKET_NAME: string (nullable = true)
|-- LOCATION: string (nullable = true)
|-- FILE_NAME: string (nullable = true)
|-- LAST_MOD_DATE: string (nullable = true)
|-- FILE_SIZE: string (nullable = true)
I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:
example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx
Question 1: I would like to derive a new column called "folder_num" and strip out the following:
1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.
How can I achieve this in spark? I'm new to this technology so your help is much appreciated.
df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column
Thank you for the help.
Added Code:
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""),
regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME| LOCATION| FILE_NAME| LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
| s3| xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21| 13124| |
| s3| xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21| 61290| |
| s3| xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21| 61700| |
回答1:
Well you are on a good way:
from pyspark.sql.functions import regexp_extract, trim
df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')
df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()
this will trim and extract the pattern of group 1 that matches the regex expression
回答2:
You can use regexp_extract and when. Refer the sample scala spark code below.
df.withColumn("folder_num",
when(regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" ,1) =!= lit(""),
regexp_extract(col("LOCATION"),".*/[A-Z]{2}([0-9]{6})/.*" , 1))
.otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1))
).show(false)
+------------------------------------------------------+----------+
|LOCATION |folder_num|
+------------------------------------------------------+----------+
|prod/docs/Folder1/AA160039/Folder2/XXX.pdf |160039 |
|prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx|355 |
+------------------------------------------------------+----------+
If you need the output of first row to AA160039, just change the grouping in regex as below.
regexp_extract(col("LOCATION"),".*/([A-Z]{2}[0-9]{6})/.*" ,1)
回答3:
The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
.when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
.otherwise("Unknown"))
Thanks.
来源:https://stackoverflow.com/questions/64602504/extract-values-from-spark-dataframe-column-into-new-derived-column