Less than comparison for date in spark scala rdd

问题

I want to print data of employees who joined before 1991. Below is my sample data:

69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001

Initial RDD for loading data:

val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})

UDF for converting string column to date column:

def convertStringToDate(s: String): Date = {
        val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
        dateFormat.parse(s)
    }

Mapping each and every column to its datatype:

val dateRdd=rdd.map(_.split(",")).map(p=>(if(p(0).length >0 )p(0).toLong else 0L,p(1),p(2),if(p(3).length > 0)p(3).toLong else 0L,convertStringToDate(p(4)),if(p(5).length >0)p(5).toDouble else 0D,if(p(6).length > 0)p(6).toDouble else 0D,if(p(7).length> 0)p(7).toInt else 0))

Now I get data in tuples as below:

(69062,FRANK,ANALYST,5646,Tue Dec 03 00:00:00 IST 1991,3100.0,0.0,2001)
(63679,SANDRINE,CLERK,69062,Tue Dec 18 00:00:00 IST 1990,900.0,0.0,2001)

Now when I execute command I am getting below error:

scala> dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
<console>:36: error: type mismatch;
 found   : String("1991")
 required: java.util.Date
              dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)

                                        ^

So where am I going wrong ???

回答1:

Since you are working with rdd's and no df's and you have date strings with simple date checking, the following non-complicated way for an RDD:

val rdd = sc.parallelize(Seq((69062,"FRANK","ANALYST",5646, "1991-12-03",3100.00,2001),(63679,"SANDRINE","CLERK",69062,"1990-12-18",900.00,2001)))
rdd.filter(p=>(p._5 < "1991-01-01")).foreach(println)

回答2:

No need to convert the date to legacy SimpleDate formats. Use Java.time. Since the 4th column is in the ISO expected format, you can simply use the below rdd step. Check this out

val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); java.time.LocalDate.parse(y(4)).isBefore(java.time.LocalDate.parse("1991-01-01")) } )

the

rdd.collect.foreach(println)

gave the below result

63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001

hope, this answers your question.

EDIT1:

Using Java 7 and SimpleFormat libraries

import java.util.Date
import java.text.SimpleDateFormat
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object DTCheck{
  def main(args:Array[String]): Unit = {

    def convertStringToDate(s: String): Date = {
      val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
      dateFormat.parse(s)
    }
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession.builder().appName("Employee < 1991").master("local[*]").getOrCreate()

    val  sdf = new SimpleDateFormat("yyyy-MM-dd")
    val dt_1991 = sdf.parse("1991-01-01")

    import spark.implicits._
    val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); convertStringToDate(y(4)).before(dt_1991 ) } )
    rdd.collect.foreach(println)
  }
}

来源：https://stackoverflow.com/questions/52610767/less-than-comparison-for-date-in-spark-scala-rdd

标签

scala

apache-spark

Hadoop

bigdata