问题
We are reading a Hbase table with Pyspark using the following commands.
from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}
cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
output = cmdata_rdd.collect()
output
I am getting the result as below. (Key and Age)
[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]
Instead am expecting Key,Tenure and Age. If I have only Tenure column then its returning Key and Tenure. But If add more columns the result always has Key and Age column.
Can anyone help us to solve this one.?
Note : We are new to this tools
Thank you in advance.
回答1:
I you're prototyping and don't want to update your cluster, it can be useful to have a look at happybase (https://happybase.readthedocs.org/en/latest/).
The following code does the trick to get my small (9Gig) Hbase table 'name_Hbase_Table' from my cluster in under a second.
import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
finalTable=[]
for key, data in table.scan(): #don't need the key in my case
line=[]
for values in data.itervalues():
line.append(values)
finalTable.append(line)
return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD
来源:https://stackoverflow.com/questions/31009988/rdd-is-having-only-first-column-value-hbase-pyspark