大数据Hadoop
HDFS
一 HDFS
1.1 概念
HDFS,全称:Hadoop Distributed File System,用于存储文件通过目录树来定位文件;其次,它是分布式的,由很多服务器联合起来实现其功能,集群中的服务器有各自的角色。
1.2
1HDFSNameNodeDataNodeSecondary Namenode
2NameNode
3DataNode datanode
4Secondary NameNode用来HDFSHDFS
1.3HDFS 文件
HDFSblock( dfs.blocksize)hadoop2.x128M64M
HDFS块时间块时间,
10ms,而100MB/s,ʹ1%100MB。的128MB
块10ms*100*100M/s = 100M
HFDS命令行操作
1
bin/hadoop fs 具体命令
2)参数
bin/hadoop fs
[-appendToFile <localsrc> ... <dst>] |
3实操
1-help:
bin/hdfs dfs -help rm
2
hadoop fs -ls /
3-mkdirhdfs
(4-moveFromLocalhdfs
/hdfs
5追加一个文件到已经存在的文件末尾
/hdfs
6-cat
hadoop fs -cat /hdfs
7
8-chmod-chownlinux
9hdfshdfs
12
10-mvhdfs/
11-getcopyToLocalhdfs
12linuxhdfs/aaa/:log.1,Linux
hadoop fs -getmerge /aaa/log.* ./log.sum
合成到不同的目录:hadoop fs -getmerge /hdfs1 /hdfs2 /
13-putcopyFromLocal
14-rm
hadoop fs -rm -r /hdfs
(15-df
(16-du
17-count
hadoop fs -count /aaa/
(18-setrephdfs3
hadoop fs -setrep 3 / hdfs
这里设置namenode的datanode3台310台时副本10
HDFS客户端
3.1IDEA
{$MAVEN_HOME/conf/settings}
--> <localRepository>F:\m2\repository</localRepository> <!--Jar--> <!--JDK8--> <profiles> <maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion> </properties> </profile> </profiles> |
3.1.0Maven
3.1.1Maven
<dependency> |
3.1.2IDEA
1HADOOP_HOME环境
2hadoopbin 、lib两个(如果IDEA
3java
publicclass publicstaticvoidthrows // 1 Configuration configuration = new // configuration.set("fs.defaultFS", "hdfs://bigdata111:9000"); FileSystem fileSystem = FileSystem.get(configuration); // // FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata111:9000"),configuration, "itstar"); // 2 fileSystem.copyFromLocalFile(newnew // 3 fileSystem.close(); System.out.println("over"); } } |
4
eclipse时可能
客户端hdfsʱ情况hdfsapijvm:-DHADOOP_USER_NAME=itstar,itstar
3.2 通过APIHDFS
3.2.1HDFS获取文件系统
1)详细
@Test publicvoidthrows // 1 信息对象 Configuration configuration = new // 2 FileSystem fs = FileSystem.get(configuration); // 3 System.out.println(fs.toString()); } |
3.2.2HDFS上传
@Test publicvoidthrows // 1 // new Configuration();jarhdfs-default.xml // classpath下的hdfs-site.xml Configuration configuration = new // 2 // 12classpath下的用户自定义配置文件 3 configuration.set("dfs.replication", "2"); FileSystem fs = FileSystem.get(new // 3 创建要上传文件所在的本地路径 Path src = new // 4 创建要上传到hdfs的目标路径 Path dst = new // 5 fs.copyFromLocalFile(src, dst); fs.close(); } |
2)测试参数
12
3.2.3HDFS下载
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // fs.copyToLocalFile(new Path("hdfs://bigdata111:9000/user/itstar/hello.txt"), new Path("d:/hello.txt")); // boolean delSrc // Path src // Path dst // boolean useRawLocalFileSystem 文件 fs.copyToLocalFile(false, newnewtrue); fs.close(); } |
3.2.4HDFS目录创建
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new //2 fs.mkdirs(new fs.close(); } |
3.2.5HDFS夹删除
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new //2 2是否递归删除,true递归 fs.delete(newtrue); fs.close(); } |
3.2.6HDFS名更改
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new //2 fs.rename(newnew fs.close(); } |
3.2.7HDFS详情查看
查看长度块
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // List RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(newtrue); while LocatedFileStatus fileStatus = listFiles.next(); System.out.println(fileStatus.getPath().getName()); System.out.println(fileStatus.getBlockSize()); System.out.println(fileStatus.getPermission()); System.out.println(fileStatus.getLen()); BlockLocation[] blockLocations = fileStatus.getBlockLocations(); for System.out.println("block-offset:" + bl.getOffset()); String[] hosts = bl.getHosts(); for System.out.println(host); } } System.out.println("--------------Andy--------------"); } } |
3.2.8HDFS文件夹判断
@Test public void findAtHDFS() throws Exception, IllegalArgumentException, IOException{ // 1 Configuration configuration = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://bigdata111:9000"),configuration, "itstar"); // 2 FileStatus[] listStatus = fs.listStatus(new Path("/")); // 3 for if System.out.println("f--" + status.getPath().getName()); } else System.out.println("d--" + status.getPath().getName()); } } } |
3.3 通过IO流操作HDFS
3.3.1HDFS上传
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // 2 FileInputStream inStream = newnew // 3 String putFileName = "hdfs://bigdata111:9000/user/itstar/hello1.txt"; Path writePath = new // 4 FSDataOutputStream outStream = fs.create(writePath); // 5 try{ IOUtils.copyBytes(inStream, outStream, 4096, false); }catch(Exception e){ e.printStackTrace(); }finally{ IOUtils.closeStream(inStream); IOUtils.closeStream(outStream); } } |
3.3.2HDFS下载
@Test publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // 2 String filename = "hdfs://bigdata111:9000/user/itstar/hello1.txt"; // 3 path Path readPath = new // 4 FSDataInputStream inStream = fs.open(readPath); // 5 try{ IOUtils.copyBytes(inStream, System.out, 4096, false); }catch(Exception e){ e.printStackTrace(); }finally{ IOUtils.closeStream(inStream); } } |
3.3.3 定位文件
1
@Test // publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // 2 Path path = new // 3 FSDataInputStream fis = fs.open(path); // 4 FileOutputStream fos = new // 5 byte[] buf = newbyte[1024]; forint fis.read(buf); fos.write(buf); } // 6 IOUtils.closeStream(fis); IOUtils.closeStream(fos); } |
2
@Test // publicvoidthrows // 1 Configuration configuration = new FileSystem fs = FileSystem.get(new // 2 Path path = new // 3 FSDataInputStream fis = fs.open(path); // 4 FileOutputStream fos = new // 5 fis.seek(1024 * 1024 * 128); // 6 IOUtils.copyBytes(fis, fos, 1024); // 7 IOUtils.closeStream(fis); IOUtils.closeStream(fos); } |
3
在window
HDFS的数据
4.1
4.1.1
1namenodenamenode
2namenode
3blockdatanode
4namenode3datanodedn1、dn2、dn3。
5dn1上传数据,dn1收到请求会继续调用dn2,然后dn2调用dn3,将这个建立完成
6dn1、dn2、dn3逐级应答客户端
7)客户端开始往dn1blockpacketdn1packetdn2,dn2传给dn3;dn1packet
8blocknamenodeblock3-7步)
4.1.2
在“彼此”是在――的
节点距离
例如d1机架r1中n1节点/d1/r1/n1这种
Distance(/d1/r1/n1, /d1/r1/n1)=0(同一节点)
Distance(/d1/r1/n1, /d1/r1/n2)=2(同一)
Distance(/d1/r1/n1, /d1/r3/n2)=4(同一数据)
Distance(/d1/r1/n1, /d2/r4/n2)=6(不同数据)
大家算每之间距离
4.1.3 机架(副本节点
1ip:
http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/RackAwareness.html
http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication
2)低版本Hadoop副本节点
第一个副本client。客户端
第副本副本位于不的
第副本和第副本,
3)Hadoop2.7.2副本节点选择
第一个副client。客户端
第副副本位于,
第副,随机
4.2
1namenodenamenodedatanode
2datanode
3datanodepacket
4packet
4.3
1debug
@Test public void writeFile() throws Exception{ // 1 Configuration configuration = new Configuration(); fs = FileSystem.get(configuration); // 2 Path path = new Path("hdfs://bigdata111:9000/user/itstar/hello.txt"); FSDataOutputStream fos = fs.create(path); // 3 fos.write("hello".getBytes()); 刷新 fos.hflush(); fos.close(); } |
2)总结
写入client
client
NameNode
5.1 NameNode&Secondary NameNode工作机制
1namenode
1namenode格式化后创建fsimage和edits。如果
2
3namenode。
4namenode
2Secondary NameNode工作
1Secondary NameNode询问namenodecheckpoint直接namenode
2Secondary NameNode请求checkpoint。
3namenode滚动edits
4Secondary NameNode
5Secondary NameNode编辑
6fsimage.chkpoint
7fsimage.chkpoint到namenode
8namenodefsimage.chkpoint重新fsimage
3)webSecondaryNameNode
1
2:http://bigdata111:50090/status.html
3SecondaryNameNode信息
4chkpoint检查参数璁剧疆
1SecondaryNameNode每隔。
[hdfs-default.xml]
<property> </property> |
21SecondaryNameNode执行
<property> <description>操作动作次数</description> </property> <property> <description> 1分钟检查一次操作次数</description> </property> |
5.2
1
namenode在/opt/module/hadoop-2.7.2/data/tmp/dfs/name/current目录文件
edits_0000000000000000000 fsimage_0000000000000000000.md5 seen_txid VERSION |
1FsimageHDFSHDFSidnode
2EditsHDFSedits
3seen_txidedits_
4Namenodefsimage00001seen_txideditsNamenodefsimageedits
2oivfsimage
1oiv和oev命令
[itstar@bigdata111 current]$ hdfs
2
hdfs oiv -p 文件
3
[itstar@bigdata111 current]$ pwd
/opt/module/hadoop-2.7.2/data/tmp/dfs/name/current
[itstar@bigdata111 current]$ hdfs oiv -p XML -i fsimage_0000000000000000025 -o /opt/module/hadoop-2.7.2/fsimage.xml
[itstar@bigdata111 current]$ cat /opt/module/hadoop-2.7.2/fsimage.xml
将显示xmlIDEAxml。
3)oevedits
(1)基本语法
hdfs oev -p 文件-o 转换
(2)案例实操
[itstar@bigdata111 current]$ hdfs oev -p XML -i edits_0000000000000000012-0000000000000000013 -o /opt/module/hadoop-2.7.2/edits.xml
[itstar@bigdata111 current]$ cat /opt/module/hadoop-2.7.2/edits.xml
将显示xmlIDEAxml。
5.3
正常HDFS时也可以
1)滚动前提
[itstar@bigdata111 current]$ hdfs dfsadmin -rollEdits
2)镜像
Namenode启动
5.4 namenode版本
1namenode
在/opt/module/hadoop-2.7.2/data/tmp/dfs/name/current这个VERSION
namespaceID=1933630176
clusterID=CID-1f2bf8d1-5ad2-4202-af1c-6713ab381175
cTime=0
storageType=NAME_NODE
blockpoolID=BP-97847618-192.168.10.102-1493726072779
layoutVersion=-63
2)namenode
1namespaceIDHDFSNamenodeNamenodenamespaceIDblockpoolID
2clusterID集群id,
3cTime属性namenode存储0
4storageType属性说明namenode的
5blockpoolIDblock pool idblock poolNamespace(format)IDBlockPoolIDNNBlockPoolIDload
6layoutVersionHDFS
5.5NameNodeĿ¼
Secondary NameNode用来HDFSHDFS
在/opt/module/hadoop-2.7.2/data/tmp/dfs/namesecondary/current这个SecondaryNameNodeĿ¼
edits_0000000000000000001-0000000000000000002 fsimage_0000000000000000002 fsimage_0000000000000000002.md5 VERSION |
SecondaryNameNode的namesecondary/currentnamenodecurrent布局
好处namenode发生假设,可以SecondaryNameNode恢复。
方法SecondaryNameNode中namenode存储数据
方法使用-importCheckpoint选项namenodeSecondaryNameNode中数据namenodeĿ¼。
1(一)
模拟namenode采用namenode
1kill
2namenode存储的(/opt/module/hadoop-2.7.2/data/tmp/dfs/name)
rm -rf /opt/module/hadoop-2.7.2/data/tmp/dfs/name/*
3SecondaryNameNode中namenode存储
cp -R /opt/module/hadoop-2.7.2/data/tmp/dfs/namesecondary/* /opt/module/hadoop-2.7.2/data/tmp/dfs/name/
4namenode
sbin/hadoop-daemon.sh start namenode
2)案例(二)
模拟namenode采用二namenode
0hdfs-site.xml中
<property> </property> <property> </property> |
1kill
2namenode存储的(/opt/module/hadoop-2.7.2/data/tmp/dfs/name)
rm -rf /opt/module/hadoop-2.7.2/data/tmp/dfs/name/*
3SecondaryNameNode不和Namenode在主机SecondaryNameNode存储Namenode存储
[itstar@bigdata111 dfs]$ pwd /opt/module/hadoop-2.7.2/data/tmp/dfs [itstar@bigdata111 dfs]$ ls |
(4)导入(等待ctrl+c结束掉)
bin/hdfs namenode -importCheckpoint
(5namenode
sbin/hadoop-daemon.sh start namenode
(6)如果提示in_use.lock
rm -rf /opt/module/hadoop-2.7.2/data/tmp/dfs/namesecondary/in_use.lock
5.6模式操作
1
Namenode映像fsimage载入edits中在fsimage,namenode开始datanodenamenodenamenode的
系统中namenode维护datanode中。ϵͳnamenode安全datanode向namenodenamenode了解多
如果满足“最小副本”,namenode30的副本99.9%副本默认dfs.replication.min=1。启动HDFSnamenode不会
2
д。启动
(1)bin/hdfs dfsadmin -safemode get (功能)
2)
3bin/hdfs dfsadmin -safemode leave (功能)
4bin/hdfs dfsadmin -safemode wait (功能)
3
模拟
1)
bin/hdfs dfsadmin -safemode enter
2)
编辑
#!/bin/bash bin/hdfs dfsadmin -safemode wait bin/hdfs dfs -put ~/hello.txt /root/hello.txt |
3)再
bin/hdfs dfsadmin -safemode leave
5.7
1namenode的目录可以配置且增加
2
hdfs-site.xml
<property> <value>file:///${hadoop.tmp.dir}/dfs/name1,file:///${hadoop.tmp.dir}/dfs/name2</value> </property> |
DataNode工作机制
6.1 DataNode工作
1一个datanode上的时间戳
2DataNode启动namenode1小时的namenode
3心跳3namenodedatanode复制。10datanode
4集群可以
6.2 完整性
1当DataNodeblockchecksum
2如果checksumblockblock
3clientDataNodeblock.
4datanodechecksum
6.3 掉线时限参数设置
datanodedatanodenamenodenamenodeHDFS10+30timeout
dfs.namenode.heartbeat.recheck-interval 5dfs.heartbeat.interval3
hdfs-site.xml heartbeat.recheck.interval毫秒dfs.heartbeat.interval秒。
<property> </property> <property> </property> |
6.4 DataNode的目录
和namenode不同datanode阶段
1/opt/module/hadoop-2.7.2/data/tmp/dfs/data/current这
[itstar@bigdata111 current]$ cat VERSION
storageID=DS-1b998a1d-71a3-43d5-82dc-c0ff3294921b
clusterID=CID-1f2bf8d1-5ad2-4202-af1c-6713ab381175
cTime=0
datanodeUuid=970b2daf-63b8-4e17-a514-d81741392165
storageType=DATA_NODE
layoutVersion=-56
2)
1storageID:id
2clusterID集群id,
3cTime属性datanode存储0
4datanodeUuiddatanode
5storageType:
6layoutVersionHDFS
3)在/opt/module/hadoop-2.7.2/data/tmp/dfs/data/current/BP-97847618-192.168.10.102-1493726072779/current这个块
[itstar@bigdata111 current]$ cat VERSION
#Mon May 08 16:30:19 CST 2017
namespaceID=1933630176
cTime=0
blockpoolID=BP-97847618-192.168.10.102-1493726072779
layoutVersion=-56
4)
1namespaceID:datanode首次namenodenamenodestorageID对datanode但对datanode,namenodedatanode
2cTime属性datanode存储0
3blockpoolIDblock pool idblock poolNamespace(format)IDBlockPoolIDNNBlockPoolIDload
4layoutVersionHDFS
6.5 Datanode
1datanode每个即
2)具体配置
hdfs-site.xml
<property> </property> |
HDFS其他
7.1 集群间
1scp实现
scp -r hello.txt root@bigdata112:/user/itstar/hello.txt // push
scp -r root@bigdata112:/user/itstar/hello.txtssh
2discp实现涓や釜hadoop
bin/hadoop distcp hdfs://haoop102:9000/user/itstar/hello.txt hdfs://bigdata112:9000/user/itstar/hello.txt
7.2 Hadoop存档
1
每个文件存储namenode的内存hadoop大量namenode注意,1MB128MB1MB磁盘128MB
HadoopHARHDFSnamenode˵HadoopMapReduce
2
1yarn
start-yarn.sh
2文件
归档xxx.har的Xx.harĿ¼
3
hadoop fs -lsr /user/my/myhar.har
hadoop fs -lsr har:///myhar.har
4文件
hadoop fs -cp har:/// user/my/myhar.har /* /user/itstar
7.3 快照管理
快照。不会写入
1)基本语法
1hdfs dfsadmin -allowSnapshot 开启)
(功能禁用指)
3对)
指定创建)
5hdfs dfs -renameSnapshot 重命名)
列出)
(7)hdfs snapshotDiff 1 2比较)
8)
2实操
1/
hdfs dfsadmin -allowSnapshot /user/itstar/data
2
通过webhdfs://bigdata111:9000/user/itstar/data/.snapshot/s…..//
hdfs dfs -lsr /user/itstar/data/.snapshot/
3创建
4
hdfs dfs -renameSnapshot /user/itstar/data/ miao170508 itstar170508
5
hdfs lsSnapshottableDir
6
7
hdfs dfs -cp /user/itstar/input/.snapshot/s20170708-134303.027 /user
7.4 回收站
1
默认值fs.trash.interval=00,可以。
默认值fs.trash.checkpoint.interval=0,检查。
要求fs.trash.checkpoint.interval<=fs.trash.interval。
2
修改core-site.xml,垃圾为1
<property> </property> |
3)查看
回收站在/user/itstar/.Trash/….
4)修改访问
进入垃圾用户dr.who,itstar
[core-site.xml]
<property> </property> |
5)moveToTrash()
Trash trash = New Trash(conf);
trash.moveToTrash(path);
6)恢复数据
7)
hdfs dfs -expunge
HDFS
8.1 HA概述
1HA(high available),7*24Сʱ。
2的消除单点。HAHAHDFSHAYARNHA
3Hadoop2.0֮ǰHDFSNameNodeSPOF。
4)NameNode以下HDFS
NameNode机器
NameNode
HDFS HA功能Active/Standby涓や釜nameNodesNameNode解决出现维护NameNode
8.2 HDFS-HA工作机制
1)通过双namenode消除单点故障
8.2.1 HDFS-HA工作
1)元数据管理方式需要改变:
内存中各自保存一份元数据;
Edits日志只有Active状态的namenode节点可以做写操作;
涓や釜namenode都可以读取edits;
edits放在一个共享存储中管理(qjournal和NFS两个主流实现);
2)需要一个状态管理功能模块
zkfailover,常驻在每一个namenode所在的节点,每一个zkfailover负责监控自己所在namenode节点,利用zk进行状态标识,当需要进行状态切换时,由zkfailover来负责切换,切换时需要防止brain split现象的发生。
3保证涓や釜NameNodessh。
4Fence,同一时刻NameNode
8.2.2 HDFS-HA自动故障转移工作
ǰhdfs haadmin -failoverNameNodeNameNodeNameNodeHAHDFSZooKeeperZKFailoverControllerZKFCZooKeeperHAZooKeeper
1)NameNodeZooKeeperZooKeeperZooKeeperNameNode
2)NameNodeZooKeeperactiveNameNodeZooKeeperNameNode
ZKFCZooKeeperNameNodeNameNodeZKFCZKFC
1ZKFCpingNameNodeNameNodeZKFC
2)ZooKeeperNameNodeZKFCZooKeeperNameNodeactiveZKFCznodeZooKeeper
3ZooKeeperNameNodeZKFCznodeNameNode为active进程NameNodeNameNodeactive
8.4HDFS-HA
8.4.1 环境准备
1IP
2)修改主机名及IP
3
4ssh
5JDK
8.4.2 规划
NameNode NameNode
JournalNode JournalNode JournalNode
DataNode DataNode DataNode
ZK ZK ZK
ResourceManager
NodeManager NodeManager NodeManager
8.4.3 Zookeeper
0)集群
在bigdata111、bigdata112和bigdata113三个Zookeeper。
1
1zookeeper安装/opt/module/
[itstar@bigdata111 software]$ tar -zxvf zookeeper-3.4.10.tar.gz -C /opt/module/
2/opt/module/zookeeper-3.4.10/这个zkData
mkdir -p zkData
3/opt/module/zookeeper-3.4.10/conf这个zoo_sample.cfgΪzoo.cfg
mv zoo_sample.cfg zoo.cfg
2)配置zoo.cfg文件
1
dataDir=/opt/module/zookeeper-3.4.10/zkData
增加
#######################cluster##########################
server.2=bigdata111:2888:3888
server.3=bigdata112:2888:3888
server.4=bigdata113:2888:3888
2
Server.A=B:C:D。
A
Bip
CLeader
D是LeaderLeader
集群myid,dataDirĿ¼AZookeeper的zoo.cfg里面server
3)
1/opt/module/zookeeper-3.4.10/zkDataĿ¼myid
touch myid
myidlinux,notepad++里面
2myid
vi myid
在与serverӦ2
3zookeeper
scp -r zookeeper-3.4.10/ root@bigdata112.itstar.com:/opt/app/
scp -r zookeeper-3.4.10/ root@bigdata113.itstar.com:/opt/app/
并myid34
4zookeeper
[root@bigdata111 zookeeper-3.4.10]# bin/zkServer.sh start
[root@bigdata112 zookeeper-3.4.10]# bin/zkServer.sh start
[root@bigdata113 zookeeper-3.4.10]# bin/zkServer.sh start
5
[root@bigdata111 zookeeper-3.4.10]# bin/zkServer.sh status
JMX enabled by