问题
I want to print data of employees who joined before 1991. Below is my sample data:
69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001
Initial RDD for loading data:
val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})
UDF for converting string column to date column:
def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.parse(s)
}
Mapping each and every column to its datatype:
val dateRdd=rdd.map(_.split(",")).map(p=>(if(p(0).length >0 )p(0).toLong else 0L,p(1),p(2),if(p(3).length > 0)p(3).toLong else 0L,convertStringToDate(p(4)),if(p(5).length >0)p(5).toDouble else 0D,if(p(6).length > 0)p(6).toDouble else 0D,if(p(7).length> 0)p(7).toInt else 0))
Now I get data in tuples as below:
(69062,FRANK,ANALYST,5646,Tue Dec 03 00:00:00 IST 1991,3100.0,0.0,2001)
(63679,SANDRINE,CLERK,69062,Tue Dec 18 00:00:00 IST 1990,900.0,0.0,2001)
Now when I execute command I am getting below error:
scala> dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
<console>:36: error: type mismatch;
found : String("1991")
required: java.util.Date
dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
^
So where am I going wrong ???
回答1:
Since you are working with rdd's and no df's and you have date strings with simple date checking, the following non-complicated way for an RDD:
val rdd = sc.parallelize(Seq((69062,"FRANK","ANALYST",5646, "1991-12-03",3100.00,2001),(63679,"SANDRINE","CLERK",69062,"1990-12-18",900.00,2001)))
rdd.filter(p=>(p._5 < "1991-01-01")).foreach(println)
回答2:
No need to convert the date to legacy SimpleDate formats. Use Java.time. Since the 4th column is in the ISO expected format, you can simply use the below rdd step. Check this out
val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); java.time.LocalDate.parse(y(4)).isBefore(java.time.LocalDate.parse("1991-01-01")) } )
the
rdd.collect.foreach(println)
gave the below result
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001
hope, this answers your question.
EDIT1:
Using Java 7 and SimpleFormat libraries
import java.util.Date
import java.text.SimpleDateFormat
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object DTCheck{
def main(args:Array[String]): Unit = {
def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.parse(s)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Employee < 1991").master("local[*]").getOrCreate()
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val dt_1991 = sdf.parse("1991-01-01")
import spark.implicits._
val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); convertStringToDate(y(4)).before(dt_1991 ) } )
rdd.collect.foreach(println)
}
}
来源:https://stackoverflow.com/questions/52610767/less-than-comparison-for-date-in-spark-scala-rdd