RDD is having only first column value : Hbase, PySpark

佐手、 提交于 2019-12-10 10:36:01

问题


We are reading a Hbase table with Pyspark using the following commands.

from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>

keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}

cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

output = cmdata_rdd.collect()

output

I am getting the result as below. (Key and Age)

[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]

Instead am expecting Key,Tenure and Age. If I have only Tenure column then its returning Key and Tenure. But If add more columns the result always has Key and Age column.

Can anyone help us to solve this one.?

Note : We are new to this tools

Thank you in advance.


回答1:


I you're prototyping and don't want to update your cluster, it can be useful to have a look at happybase (https://happybase.readthedocs.org/en/latest/).

The following code does the trick to get my small (9Gig) Hbase table 'name_Hbase_Table' from my cluster in under a second.

import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
    finalTable=[]
    for key, data in table.scan(): #don't need the key in my case
        line=[]
        for values in data.itervalues():
            line.append(values)
        finalTable.append(line)
    return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD


来源:https://stackoverflow.com/questions/31009988/rdd-is-having-only-first-column-value-hbase-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!