Less than comparison for date in spark scala rdd

 ̄綄美尐妖づ 提交于 2019-12-08 07:16:58

问题


I want to print data of employees who joined before 1991. Below is my sample data:

69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001 

Initial RDD for loading data:

val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})

UDF for converting string column to date column:

def convertStringToDate(s: String): Date = {
        val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
        dateFormat.parse(s)
    }

Mapping each and every column to its datatype:

val dateRdd=rdd.map(_.split(",")).map(p=>(if(p(0).length >0 )p(0).toLong else 0L,p(1),p(2),if(p(3).length > 0)p(3).toLong else 0L,convertStringToDate(p(4)),if(p(5).length >0)p(5).toDouble else 0D,if(p(6).length > 0)p(6).toDouble else 0D,if(p(7).length> 0)p(7).toInt else 0))  

Now I get data in tuples as below:

(69062,FRANK,ANALYST,5646,Tue Dec 03 00:00:00 IST 1991,3100.0,0.0,2001)
(63679,SANDRINE,CLERK,69062,Tue Dec 18 00:00:00 IST 1990,900.0,0.0,2001)

Now when I execute command I am getting below error:

scala> dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
<console>:36: error: type mismatch;
 found   : String("1991")
 required: java.util.Date
              dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)

                                        ^

So where am I going wrong ???


回答1:


Since you are working with rdd's and no df's and you have date strings with simple date checking, the following non-complicated way for an RDD:

val rdd = sc.parallelize(Seq((69062,"FRANK","ANALYST",5646, "1991-12-03",3100.00,2001),(63679,"SANDRINE","CLERK",69062,"1990-12-18",900.00,2001)))
rdd.filter(p=>(p._5 < "1991-01-01")).foreach(println)



回答2:


No need to convert the date to legacy SimpleDate formats. Use Java.time. Since the 4th column is in the ISO expected format, you can simply use the below rdd step. Check this out

val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); java.time.LocalDate.parse(y(4)).isBefore(java.time.LocalDate.parse("1991-01-01")) } )

the

rdd.collect.foreach(println)

gave the below result

63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001

hope, this answers your question.

EDIT1:

Using Java 7 and SimpleFormat libraries

import java.util.Date
import java.text.SimpleDateFormat
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object DTCheck{
  def main(args:Array[String]): Unit = {

    def convertStringToDate(s: String): Date = {
      val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
      dateFormat.parse(s)
    }
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession.builder().appName("Employee < 1991").master("local[*]").getOrCreate()

    val  sdf = new SimpleDateFormat("yyyy-MM-dd")
    val dt_1991 = sdf.parse("1991-01-01")

    import spark.implicits._
    val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); convertStringToDate(y(4)).before(dt_1991 ) } )
    rdd.collect.foreach(println)
  }
}


来源:https://stackoverflow.com/questions/52610767/less-than-comparison-for-date-in-spark-scala-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!