Spark filtering with regex

做~自己de王妃 提交于 2019-11-30 23:31:36

There are 2 issues with the code:

  1. The character that you are using to split the lines of data.txt is wrong. It should be '|' instead of "|".
  2. The regex singleReg is wrong.

The correct code is as follows:

Load and RDD

scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split('|'))

RegEx

scala> val singleReg = """\w{3}\s\d{2},\s\d{4}|\d{2}\s\w{3},\s\d{4}|\d{1}\/\d{2}\/\d{4}|\d{2}-\d{2}-\d{4}""".r

Filter

scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))

Turn array into string

scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))

Write file

scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")

The above code will give you following output -

data/singValid

(Christopher,Jan 11, 2017,5 )
(Justin,11 Jan, 2017,5 )
(Thomas,6/17/2017,5 )
(John,11-08-2017,5 )

data/singBad

(Neli,2016,5 )
(Bilu,,5)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!