HDFS | 易学教程

Finding total number of lines in hdfs distributed file using command line

阅读更多关于 Finding total number of lines in hdfs distributed file using command line

问题 I am working on a cluster where a dataset is kept in hdfs in distributed manner. Here is what I have: [hmi@bdadev-5 ~]$ hadoop fs -ls /bdatest/clm/data/ Found 1840 items -rw-r--r-- 3 bda supergroup 0 2015-08-11 00:32 /bdatest/clm/data/_SUCCESS -rw-r--r-- 3 bda supergroup 34404390 2015-08-11 00:32 /bdatest/clm/data/part-00000 -rw-r--r-- 3 bda supergroup 34404062 2015-08-11 00:32 /bdatest/clm/data/part-00001 -rw-r--r-- 3 bda supergroup 34404259 2015-08-11 00:32 /bdatest/clm/data/part-00002 ....

How to move or copy file in HDFS by using JAVA API

阅读更多关于 How to move or copy file in HDFS by using JAVA API

问题 I want to copy file in THE SAME HDFS ,just like copy file from HDFS://abc:9000/user/a.txt to HDFS://abc:9000/user/123/ Can I do that by using JAVA API? Thanks 回答1: FileUtil provides a method for copying files. Configuration configuration = new Configuration(); configuration.set("fs.defaultFS", "hdfs://abc:9000"); FileSystem filesystem = FileSystem.get(configuration); FileUtil.copy(filesystem, new Path("src/path"), filesystem, new Path("dst/path"), false, configuration); If you need to copy it

Hadoop HDFS - Cannot connect to port on master

阅读更多关于 Hadoop HDFS - Cannot connect to port on master

问题 I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back. I formatted the cluster with bin/hadoop namenode -format , and then started all services using bin/start-all.sh . All processes on all nodes were checked

hadoop完全分布式安装

阅读更多关于 hadoop完全分布式安装

下面记录下hadoop完全分布式安装的过程，其中hadoop使用的版本是apache下的，不是cdh。完全分布式示意图下面在三台节点上安装hadoop完全分布式，其中一个服务器节点上将有多个hadoop相关的节点，最后是压缩到三台的安装效果，正常来说至少13个服务节点。（1）zookeeper用于管理namenode，用于故障转移主备切换，其中zookeeper通过failoverController进程来进行namenode主备切换。（2）namenode主备之间通过journalNode来进行通信，进行数据同步。（3）resourceManager也会有两个，一个挂了另外一个顶上。（4）datanode上储存数据，MR计算有数据本地化策略，nodeManager一般和datanode在一起。以上是最后安装的节点分布图，下面开始安装部署。前置准备前置准备包括关闭linux防火墙、修改主机名、ip映射、配置jdk和免密登录，可参考 https://www.cnblogs.com/youngchaolin/p/11992600.html ，其中这里使用的主机名分别为hadoop01、hadoop02和hadoop03。ip映射需修改/etc/hosts文件，添加三台ip和节点名的映射关系。以上操作三台都需要准备好，容易出现问题的就是免密登录，下面记录一下。

spark

阅读更多关于 spark

Apache Spark 什么是Spark？ Spark是 Lightning-fast unified analytics engine - 快如闪电的统一的分析引擎（不参与数据持久化）。快（1）Spark基于内存的计算引擎，相比于MapReduce磁盘计算，速度自然快 - 大众认知（2）Spark使用先进的DAG（矢量计算）计算模型，将一个复杂的任务拆分成若干个stage（阶段），这样复杂的任务Spark只需要一个job即可完成。（如果使用MapReduce计算模型可能需要串连若干个Job）（3） Spark实现DAG计算将任务划分为若干个阶段，同时也提供了对stage阶段计算数据的缓存能力，这样就极大提升计算效率和容错。统一：Spark统一大数据常见计算例如：批处理（替代MapReduce）、流处理（替代Storm）、统一SQL（替代了Hive）、Machine Learning（替代Mahout 基于MapReduce计算方案）、支持GraphX存储-图形关系存储（存储数据结构）（替代了早期Neo4j的功能） Spark VS Hadoop Spark的诞生仅仅是为了替换早期的Hadoop的MapReduce计算引擎。Spark并没有存储解决方案，在Spark的架构中，底层存储方案依然延续Hadooop的HDFS/Hbase

Hbase的架构

阅读更多关于 Hbase的架构

- Client : hbase客户端， 1.包含访问hbase的接口。比如，linux shell，java api。 2.除此之外，它会维护缓存来加速访问hbase的速度。比如region的位置信息。 - Zookeeper ： 1.监控Hmaster的状态，保证有且仅有一个活跃的Hmaster。达到高可用。 2.它可以存储所有region的寻址入口。如：root表在哪一台服务器上。 3. 实时监控HregionServer的状态，感知HRegionServer的上下线信息，并实时通知给Hmaster。 4. 存储hbase的部分元数据。 - HMaster : 1. 为HRegionServer分配Region（新建表等）。 2. 负责HRegionServer的负载均衡。 3. 负责Region的重新分配（HRegionServer宕机之后的Region分配，HRegion裂变：当Region过大之后的拆分）。 4. Hdfs上的垃圾回收。 5. 处理schema的更新请求 - HRegionServer ： 1. 维护HMaster分配给的Region（管理本机的Region）。 2. 处理client对这些region的读写请求，并和HDFS进行交互。 3. 负责切分在运行过程中组件变大的Region。 - HLog ： 1. 对HBase的操作进行记录

HDFS的DataNode源码分析

阅读更多关于 HDFS的DataNode源码分析

1.大致流程 DataNode.main() // 入口函数　　　　|——secureMain(args, null); 　　　　　　|——createDataNode(args, null, resources); // 创建DataNode 　　　　　　　　|——instantiateDataNode(args, conf, resources); 　　　　　　　　 |——getStorageLocations(conf); // 根据配置拿到HDFS的Block实际存储的本地路径，即hdfs-site.xml文件中的dfs.datanode.data.dir属性　　　　　　　　 |——UserGroupInformation.setConfiguration(conf); // 设置配置　　　　　　　　 |——makeInstance(dataLocations, conf, resources); // 实例化DataNode 　　　　 |——dn.runDatanodeDaemon(); // 运行DataNode的后台守护线程 |——datanode.join(); // 将此DataNode放入一个线程等待池 2.详解makeInstance(dataLocations, conf, resources)方法 1. List<StorageLocation>

HDFS读写流程

阅读更多关于 HDFS读写流程

1.写流程详细流程：创建文件： HDFS客户端向HDFS写数据，先调用DistributedFileSystem.create()方法，在HDFS创建新的空文件 RPC（ClientProtocol.create()）远程过程调用NameNode（NameNodeRpcServer）的create()，首先在HDFS目录树指定路径添加新文件然后将创建新文件的操作记录在editslog中 NameNode.create方法执行完后，DistributedFileSystem.create()返回FSDataOutputStream，它本质是封装了一个DFSOutputStream对象建立数据流管道：客户端调用DFSOutputStream.write()写数据 DFSOutputStream调用ClientProtocol.addBlock()，首先向NameNode申请一个空的数据块 addBlock()返回LocatedBlock对象，对象包含当前数据块的所有datanode的位置信息根据位置信息，建立数据流管道向数据流管道pipeline中写当前块的数据：客户端向流管道中写数据，先将数据写入一个检验块chunk中，大小512Byte，写满后，计算chunk的检验和checksum值（4Byte）然后将chunk数据本身加上checksum

Impala 表使用 Parquet 文件格式

阅读更多关于 Impala 表使用 Parquet 文件格式

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Impala 表使用 Parquet 文件格式 Impala 帮助你创建、管理、和查询 Parquet 表。Parquet 是一种面向列的二进制文件格式，设计目标是为 Impala 最擅长的大规模查询类型提供支持(Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at)。Parquet 对于查询扫描表中特定的列特别有效，例如查询一个包含许多列的"宽"表，或执行需要处理列中绝大部分或全部的值的如 SUM(),AVG() 等聚合操作(Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG()that need to process most or all of

Large Block Size in HDFS! How is the unused space accounted for?

阅读更多关于 Large Block Size in HDFS! How is the unused space accounted for?

问题 We all know that the block size in HDFS is pretty large (64M or 128M) as compared to the block size in traditional file systems. This is done in order to reduce the percentage of seek time compared to the transfer time (Improvements in transfer rate have been on a much larger scale than improvements on the disk seek time therefore, the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred). But this comes with an

订阅 HDFS