Scan a Hadoop Database table in Spark using indices from an RDD

你离开我真会死。 提交于 2019-12-01 13:44:45

The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.

But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a create external table in Hive).

Then I create a SQL context using the Hive context.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

The full toy table has 3 rows:

df = sqlContext.sql("select * from hbase_table_1")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
|BBB1|    jklm|
+----+--------+

and to access a subset of the HBase rowkeys:

df = sqlContext.sql("select * from hbase_table_1 where key >= 'AAA' and key < 'BBB'")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
+----+--------+

For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!