Scan a Hadoop Database table in Spark using indices from an RDD

问题

So if there is a table in the database shown as below:

Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX AAA3 XXX XXX XXX AAA4 XXX XXX XXX AAA5 XXX XXX XXX AAA6 XXX XXX XXX AAA7 XXX XXX XXX AAA8 XXX XXX XXX BBB1 XXX XXX XXX BBB2 XXX XXX XXX BBB3 XXX XXX XXX BBB4 XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX CCC2 XXX XXX XXX CCC3 XXX XXX XXX CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX DDD2 XXX XXX XXX DDD3 XXX XXX XXX DDD4 XXX XXX XXX DDD5 XXX XXX XXX DDD6 XXX XXX XXX DDD7 XXX XXX XXX I have a 2nd table, given as 1 AAA 2 DDD 3 CCC Since AAA,DDD and CCC are in table2, I want to scan all rows corresponding to these 3 values from table1, i.e, I want to get AAA1-AAA8, DDD1 -DDD7 and CCC1-CCC7. The table1 is a Hadoop database table, and has millions of rows. I only want to scan it for rows which I get from table1. Could anyone help me with an efficent way of doing this? I have table1 as an RDD, and table2 is in HBase.

回答1:

The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.

But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a create external table in Hive).

Then I create a SQL context using the Hive context.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

The full toy table has 3 rows:

df = sqlContext.sql("select * from hbase_table_1")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
|BBB1|    jklm|
+----+--------+

and to access a subset of the HBase rowkeys:

df = sqlContext.sql("select * from hbase_table_1 where key >= 'AAA' and key < 'BBB'")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
+----+--------+

For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.

来源：https://stackoverflow.com/questions/43548858/scan-a-hadoop-database-table-in-spark-using-indices-from-an-rdd

标签

apache-spark

hbase

pyspark

rdd