HDFS

How does HDFS manage block size?

徘徊边缘 提交于 2020-01-04 05:58:34
问题 My file size is 65MB and default hdfs block size(64MB), then how many 64MB blocks will be allotted to my file? Is it like 1-64MB block, 1-1MB block or 2-64MB blocks? If it is 2-64MB blocks is it going to be wasted rest of the 63MB or will it be allocated to other file? 回答1: Block size 64MB means an upper bound size for a block. It doesn't mean that file blocks less than 64MB will consume 64MB. It will not consume 64MB to store a chunk of 1MB. If the file is 160 megabytes , Hope this helps.

UserGroupInformation: No groups available for user

蓝咒 提交于 2020-01-04 04:36:24
问题 I am trying to submit a remote job in mapreduce, but I get the error [1]. I even have set in hdfs-site.xml in the remote hadoop the content [2], and changed permissions [3], but the problem remains. The client is xeon, and the superuser is xubuntu. How I add a remote user permission to submit in mapreduce? How I set a group for xeon? [1] 2015-04-23 05:57:35,648 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user xeon [2] <property> <name>dfs.web.ugi</name>

Can Spool Dir of flume be in remote machine?

烈酒焚心 提交于 2020-01-04 02:45:06
问题 I was trying to fetch files from a remote machine to my hdfs whenever a new file has arrived into a particular folder. I came across the concept of spool dir in flume, and it was working fine if the spool dir is in the same machine where the flume agent is running. Is there any method to configure a spool dir in a remote machine ?? Please help. 回答1: You might be aware that flume can spawn multiple instances, i.e. you can install several flume instances which pass the data between them. So to

hadoop配置相关

[亡魂溺海] 提交于 2020-01-03 18:24:34
core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://earth</value> <final>true</final> </property> <property> <name>hadoop.tmp.dir</name> <value>/data1/tmp-security</value> <final>true</final> </property> <property> <name>ha.zookeeper.quorum</name> <value>hadoop-btzk0001.eniot.io:2181,hadoop-btzk0002.eniot.io:2181,hadoop-btzk0003.eniot.io:2181</value> </property> <property> <name>ha.failover-controller.active-standby-elector.zk.op.retries</name> <value>120</value> </property> <property> <!-- 磁盘du间隔,du对磁盘IO影响比较大 --> <name>fs.du.interval</name> <value>1200000</value>

Hadoop datanode fails to start throwing org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage

三世轮回 提交于 2020-01-03 17:14:21
问题 I have some problems trying to start a datanode in Hadoop, from the log I can see that datanode is started twice (partial log follows): 2012-05-22 16:25:00,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = master/192.168.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1

Hadoop datanode fails to start throwing org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage

一笑奈何 提交于 2020-01-03 17:14:18
问题 I have some problems trying to start a datanode in Hadoop, from the log I can see that datanode is started twice (partial log follows): 2012-05-22 16:25:00,369 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = master/192.168.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1

impala paper笔记1

倾然丶 夕夏残阳落幕 提交于 2020-01-03 15:22:23
不生产博客,只是汉化别人的成果 目录 摘要 介绍 用户角度的impala 物理schema设计 sql 支持 架构 state distribution catalog service impala paper的链接 http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf 摘要 impala是一个现代化,开源的mpp sql引擎架构,一开始就是为了处理hadoop环境上的数据。impala提供低延迟和高并发的query对于hadoop上的BI/OLAP,不像hive那样的批处理框架,这篇paper从使用者的角度阐述impala的总体架构和组件,简要说明Impala较别的sql on hadoop的优势 介绍 impala是开源的,最先进的mpp sql引擎,与hdaoop高度集成,高伸缩、高灵活。impala的目的是结合sql支持与传统数据库的多用户高性能(高并发)在hadoop上 不像别的系统,eg:postgre,impala是一个全新的引擎,由c++和java编写,拥有像hadoop一样的灵活性通过结合一些组件,eg:hdfs、hbase、hive metastore等等,并且能够读取常用的存储格式数据,eg:parquet、rcfile、avro等,为了降低延迟,没有使用类似mapreduce和远程拉取数据

NoSuchMethodError writing Avro object to HDFS using Builder

怎甘沉沦 提交于 2020-01-03 10:19:32
问题 I'm getting this exception when writing an object to HDFS: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/avro/Schema; at com.blah.SomeType.<clinit>(SomeType.java:10) The line it is referencing in the generated code is this: public class SomeType extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { public static final org.apache.avro.Schema SCHEMA$

NoSuchMethodError writing Avro object to HDFS using Builder

[亡魂溺海] 提交于 2020-01-03 10:19:28
问题 I'm getting this exception when writing an object to HDFS: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/avro/Schema; at com.blah.SomeType.<clinit>(SomeType.java:10) The line it is referencing in the generated code is this: public class SomeType extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { public static final org.apache.avro.Schema SCHEMA$

Hadoop分布式文件系统之HDFS

不羁岁月 提交于 2020-01-03 05:34:16
转自: https://blog.csdn.net/bingduanlbd/article/details/51914550#t24 1. 介绍 在现代的企业环境中,单机容量往往无法存储大量数据,需要跨机器存储。统一管理分布在集群上的文件系统称为分布式文件系统。而一旦在系统中,引入网络,就不可避免地引入了所有网络编程的复杂性,例如挑战之一是如果保证在节点不可用的时候数据不丢失。 传统的网络文件系统(NFS)虽然也称为分布式文件系统,但是其存在一些限制。由于NFS中,文件是存储在单机上,因此无法提供可靠性保证,当很多客户端同时访问NFS Server时,很容易造成服务器压力,造成性能瓶颈。另外如果要对NFS中的文件中进行操作,需要首先同步到本地,这些修改在同步到服务端之前,其他客户端是不可见的。某种程度上,NFS不是一种典型的分布式系统,虽然它的文件的确放在远端(单一)的服务器上面。 从NFS的协议栈可以看到,它事实上是一种VFS(操作系统对文件的一种抽象)实现。 HDFS,是Hadoop Distributed File System的简称,是Hadoop抽象文件系统的一种实现。Hadoop抽象文件系统可以与本地系统、Amazon S3等集成,甚至可以通过Web协议(webhsfs)来操作。HDFS的文件分布在集群机器上,同时提供副本进行容错及可靠性保证