hbase | 易学教程

Spark Direct Stream Kafka order of events

阅读更多关于 Spark Direct Stream Kafka order of events

问题 I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase. I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ). The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered

HBase Cluster: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V

阅读更多关于 HBase Cluster: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V

问题 I'm new to HBase. I'm running a HBase cluster on 2 machines (1 master on one machine and 1 regionserver on the second). When I start the hbase shell using: bin/hbase shell and I create a table using this syntax: create 't1', 'f1' I get the following errors: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hduser/hbase-0.98.8-hadoop2/lib/slf4j-log4j12- 1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local

Hadoop HBase Pseudo mode - RegionServer disconnects after some time

阅读更多关于 Hadoop HBase Pseudo mode - RegionServer disconnects after some time

问题 Please find the attached screenshot of Hbase-master log. I have tried all sorts of settings yet I couldn't overcome this issue. I made sure I don't have 127.0.1.1 in my /etc/hosts. I am using Apache Hadoop 0.20.205.0 and Apache HBase 0.90.6 in Pseudo distributed . I am using Nutch 2.2.1 and trying to store crawled data in HBase Pseudo mode. I am using bin/crawl all-in-one command. Please help! 回答1: Try killing the master then start it up again... The dead server state is in memory... hope

Hadoop MapReduce NoSuchElementException

阅读更多关于 Hadoop MapReduce NoSuchElementException

问题 I wanted to run a MapReduce-Job on my FreeBSD-Cluster with two nodes but I get the following Exception 14/08/27 14:23:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/27 14:23:04 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/08/27 14:23:04 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/08/27 14:23:04 WARN

Master not running

阅读更多关于 Master not running

问题 hi folks I am getting this exception while running HBase in my master logs and HMaster is not running. 2012-05-20 11:54:38,206 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/23.21.190.123:2181 2012-05-20 11:54:38,236 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to localhost/23.21.190.123:2181, initiating session 2012-05-20 11:54:38,291 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server localhost/23.21.190

大数据存储框架之HBase(3) NameSpace/Schema

阅读更多关于大数据存储框架之HBase(3) NameSpace/Schema

大数据存储框架之HBase(3) NameSpace/Schema 我们都是到HBase里面有工作区间一说，这个工作区间也就相当于我们关系数据库中的数据库，也相当于我们Phoenix中的Schema。这里主要讲Phoenix和HBase的namespace之间的爱恨情仇的故事。先配置HBase开启与Phoenix的schema之间的转换 < property > < name > phoenix.schema.isNamespaceMappingEnabled </ name > < value > true </ value > </ property > < property > < name > phoenix.schema.mapSystemTablesToNamespace </ name > < value > true </ value > </ property > 在HBase中namespace相关操作 -- 创建namespace create_namespace 'CI123' -- 删除namespace drop_namespace 'CI123' -- 查看namespace list_namespace -- 在namespace下创建表 create 'CI123:table_name' , 'family1' -- 查看namespace下的表

scanning for rowkeys between start and end time

阅读更多关于 scanning for rowkeys between start and end time

问题 I have a hbase table where the rowkey pattern is {id1},{id2},{millisec}, I need to get all the rowkeys between start and end millisec keeping either id1 or id2 constant, how do i accomplish in hbase ? I am using a java client. Thanks 回答1: a. For a known {id1} You have to perform a scan and provide the start & stop rows. Take a look at this example extracted from the HBase reference guide: public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ...

Why rdd is always empty during Real-Time Kafka Data Ingestion into HBase via PySpark?

阅读更多关于 Why rdd is always empty during Real-Time Kafka Data Ingestion into HBase via PySpark?

问题 I try to do Real-Time Kafka Data Ingestion into HBase via PySpark in accordance to this tutorial. Everything seems to be working fine. I start kafka sudo /usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties then I run producer /usr/local/kafka/bin/kafka-console-producer.sh --broker-list=myserver:9092 --topic test . Then I run source code shown below. I send messages in producer however rdd.isEmpty(): is always empty. So I don’t achieve line with print("=some

Using Apache Phoenix and Spark to save a CSV in HBase #spark2.2 #intelliJIdea

阅读更多关于 Using Apache Phoenix and Spark to save a CSV in HBase #spark2.2 #intelliJIdea

问题 I have been trying to load data from a CSV using Spark and write it to HBase. I am able to do it in Spark 1.6 easily but not in Spark 2.2. I have tried multiple approaches and finally/ultimately everything leads me to the same error with Spark 2.2: Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string Any idea why this is happening. Sharing code snippet: def main(args : Array[String]) { val spark = SparkSession.builder .appName("PhoenixSpark

浅谈我的转型大数据学习之路

阅读更多关于浅谈我的转型大数据学习之路

一、背景介绍本人目前是一名大数据工程师，项目数据50T，日均数据增长20G左右，个人是从Java后端开发，经过3个月的业余自学成功转型大数据工程师。二、大数据介绍大数据本质也是数据，但是又有了新的特征，包括数据来源广、数据格式多样化（结构化数据、非结构化数据、Excel文件、文本文件等）、数据量大（最少也是TB级别的、甚至可能是PB级别）、数据增长速度快等。针对以上主要的4个特征我们需要考虑以下问题：数据来源广，该如何采集汇总？，对应出现了Sqoop，Cammel，Datax等工具。数据采集之后，该如何存储？，对应出现了GFS，HDFS，TFS等分布式文件存储系统。由于数据增长速度快，数据存储就必须可以水平扩展。数据存储之后，该如何通过运算快速转化成一致的格式，该如何快速运算出自己想要的结果？对应的MapReduce这样的分布式运算框架解决了这个问题；但是写MapReduce需要Java代码量很大，所以出现了Hive，Pig等将SQL转化成MapReduce的解析引擎；普通的MapReduce处理数据只能一批一批地处理，时间延迟太长，为了实现每输入一条数据就能得到结果，于是出现了Storm/JStorm这样的低时延的流式计算框架；但是如果同时需要批处理和流处理，按照如上就得搭两个集群，Hadoop集群（包括HDFS+MapReduce+Yarn