Spark RDD find by key | 易学教程

问题

I have an RDD transformed from HBase:

val hbaseRDD: RDD[(String, Array[String])] where the tuple._1 is the rowkey. and the array are the values in HBase.

4929101-ACTIVE, ["4929101","2015-05-20 10:02:44","dummy1","dummy2"]
4929102-ACTIVE, ["4929102","2015-05-20 10:02:44","dummy1","dummy2"]
4929103-ACTIVE, ["4929103","2015-05-20 10:02:44","dummy1","dummy2"]

I also have a SchemaRDD (id,date1,col1,col2,col3) transformed to

val refDataRDD: RDD[(String, Array[String])] for which I will iterate over and check if it exists in hbaseRDD:

4929103, ["2015-05-21 10:03:44","EV01","col2","col3"]
4929104, ["2015-05-21 10:03:44","EV02","col2","col3"]

Question is,

How do I check if a key (tuple._1)/("4929103") is existing in hbaseRDD and get the corresponding values (tuple._2)? - I can't use PairRDD's lookup function inside an rdd.filter, it throws "scala.MatchError: null", but it works outside
```
val filteredRDD = rdd.filter(sqlRow => {
  val hbaseLookup = hbaseRDD.lookup(sqlRow(0).toString + "-ACTIVE")
  // if found, check if date1 of hbaseRDD < sqlRow(1)
  // else if not found, retain row
  true
})
```
I'm not sure if that's the problem though, since I am also experiencing a NPE when I switch the lookup line to:
```
val sqlRowHbase = hbaseRDD.filter(row => {
```
Note: I am doing an hbaseRDD.count before these lines. and hbaseRDD.lookup works fine outside the rdd.filter

So basically, I am trying to "find" by key in hbaseRDD and get the row/values. It's a little complicated to join them since some values in both RDDs may be null. And it depends on a lot of scenarios what row would be retained with what data.

回答1:

Assuming the set of a_id you need to lookup is contained in an RDD, I think you could use a leftOuterJoin instead of iterating and looking up each value.

I saw your comment above regarding the potentially changeable position of date1. I'm not addressing it below though, I think this should be handled before the lookup itself by some kind of specific mapping of each row.

If I get the pseudo code correctly, you have an RDD of (id, date) and want to update it by looking up data in hbase and update the date if a row is found in hbase for this id and if its date is earlier than the one in refData. Is that correct?

If so, assuming you have some ref data like this:

val refData = sc.parallelize(Array(
 ("4929103","2015-05-21 10:03:44"),
 ("4929104","2015-05-21 10:03:44")
))

And some row data from Hbase:

val hbaseRDD = sc.parallelize(Array(
    ("4929101-ACTIVE", Array("4929101","2015-05-20 10:02:44")),
    ("4929102-ACTIVE", Array("4929102","2015-05-20 10:02:44")),
    ("4929103-ACTIVE", Array("4929103","2015-05-20 10:02:44"))
))

Then you can do the lookup of each id from refData into hbase with a simple leftOuterJoin, and for each row found: update the date if necessary:

refData
  // looks up in Hbase all rows whose date1 a_id value matches the id in searchedIds
  .leftOuterJoin(hbaseRDD.map{ case (rowkey, Array(a_id, date1)) => (a_id, date1)})

  // update the date in refData if date from hBase is earlier
  .map { case (rowKey, (refDate, maybeRowDate)) => ( rowKey, chooseDate (refDate, maybeRowDate)) }
  .collect


def chooseDate(refDate: String, rowDate: Option[String]) =  rowDate match {

  // if row not found in Hbase: keep ref date
  case None => refDate

  case Some(rDate) => 
    if (true) /* replace this by first parsing the date, then check if rowDate < refDate */ 
        rowDate
    else
        refDate
}

来源：https://stackoverflow.com/questions/30421484/spark-rdd-find-by-key

标签

scala

apache-spark

MapReduce

hbase

rdd