HDFS

Hadoop高可用集群——HA

℡╲_俬逩灬. 提交于 2019-12-30 11:40:10
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 在Hadoop2.0之前,HDFS的NameNode存在单点故障问题。所谓HA,集高可用(7*24小时不中断服务)。HA严格意义来说应分成各个组件的HA机制:HDFS的HA和YARN的HA。HDFS HA功能通过配置Active/Standby两个NameNode实现在集群中对NameNode的热备份来解决单点故障。如果出现故障(如:机器崩溃/机器需要升级维护),这时可以通过HA将NameNode很快切换到另一台机器。 HA 集群配置 环境准备 配置主机名及主机名和ip映射 关闭防火墙 ssh免密登录 安装JDK,配置环境变量 配置Zookeeper集群 解压Zookeeper到指定目录 $ tar -zxvf zookeeper-3.4.10.tar.gz -C /export/servers 在/export/servers/zookeeper-3.4.10/这个目录下创建 zkData mkdir -p zkData 重命名/export/servers/zookeeper-3.4.10/conf 这个目录下的 zoo_sample.cfg 为 zoo.cfg并修改 mv zoo_sample.cfg zoo.cfg //具体配置 dataDir=/export/servers/zookeeper-3.4

About hadoop hdfs filesystem rename

╄→гoц情女王★ 提交于 2019-12-30 11:24:08
问题 I am storing lots of data into hdfs. And I need to move the files from one folder to another. May I ask generally how much is the cost of filesystem's rename method? Say I have to move terabytes of data. Thank you very much. 回答1: Moving files in HDFS or any file system if implemented properly involves changes to the name space and not moving of the actual data. Going through the code only changes in the name space (memory and edit log) in the Name node are done. From the NameNode.java class

Hadoop分布式文件系统——HDFS

旧巷老猫 提交于 2019-12-30 10:28:49
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 随着数据量越来越大,在一个操作系统中存不下所有的数据,那么就分配到更多的操作系统管理的磁盘中,但是不方便管理和维护,急切 需要一种系统管理多态机器上的文件, 这就是分布式文件管理系统。 HDFS的定义 HDFS:是一个文件系统 ,用于存储文件,通过目录树来定义文件;其次,它是分布的,有很多服务器联合起来实现其功能,集群中服务器有各自的角色 HDFS的使用场景:一次写入,多次读。 HDFS优缺点 优点: 高容错性(分而治之) : 数据自动保存多个副本,某个副本丢失以后可以自动恢复数据 处理大数据: 数据规模:可以处理GB、TB甚至PB级别的数据 文件规模:能够处理百万规模以上的文件数量 可构建在廉价的机器上 缺点: 不适合低延迟数据访问 ,如毫秒级的数据存储是做不到的。 无法高效存储大量小文件 存储文件会占用NameNode来存储元数据,可能占用资源会更多 小文件存储的寻址时间会超过读取时间,违反了HDFS的设计目标 不支持并发写入、文件随机修改 一个文件只能有一个线程操作 仅支持数据追加操作,不支持文件随机修改 HDFS的组成 NameNode(nn) :就是Master,是HDFS的主要管理者 管理HDFS的命名空间 配置副本策略 管理数据块(Block)映射信息 处理客户端请求 DataNode(dn)

How can one list all csv files in an HDFS location within the Spark Scala shell?

廉价感情. 提交于 2019-12-30 08:29:42
问题 The purpose of this is in order to manipulate and save a copy of each data file in a second location in HDFS. I will be using RddName.coalesce(1).saveAsTextFile(pathName) to save the result to HDFS. This is why I want to do each file separately even though I am sure the performance will not be as efficient. However, I have yet to determine how to store the list of CSV file paths into an array of strings and then loop through each one with a separate RDD. Let us use the following anonymous

Process Spark Streaming rdd and store to single HDFS file

主宰稳场 提交于 2019-12-30 07:28:31
问题 I am using Kafka Spark Streaming to get streaming data. val lines = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Set(topic)).map(_._2) I am using this DStream and processing RDDs val output = lines.foreachRDD(rdd => rdd.foreachPartition { partition => partition.foreach { file => runConfigParser(file)} }) runConfigParser is a JAVA method which parses a file and produces an output which i have to save in HDFS. So multiple nodes will process

Secondary NameNode usage and High availability in Hadoop 2.x

只谈情不闲聊 提交于 2019-12-30 05:23:57
问题 Can you please help me out to the below scenarios. 1) While using Hadoop V2, do we use Secondary NameNode in production environment? 2) For Hadoop V2, suppose we use muliple NameNodes in active/passive connection for High Availability and when the Edits Log file is growing huge, How does the edits log gets applied to fsimage? If so, then applying the huge Edits log to Namenode would be time consuming during startup of Namenode? (We had Secondary NameNode in hadoop v1 to solve this problem)

Accessing files in HDFS using Java

五迷三道 提交于 2019-12-30 03:25:08
问题 I am trying to access a file in the HDFS using Java APIs, but everytime I am getting File Not Found. Code which I am using to access is :- Configuration conf = new Configuration(); conf.addResource(FileUtilConstants.ENV_HADOOP_HOME + FileUtilConstants.REL_PATH_CORE_SITE); conf.addResource(FileUtilConstants.ENV_HADOOP_HOME + FileUtilConstants.REL_PATH_HDFS_SITE); try { FileSystem fs = FileSystem.get(conf); Path hdfsfilePath = new Path(hdfsPath); logger.info("Filesystem URI : " + fs.getUri());

Wildcard in Hadoop's FileSystem listing API calls

…衆ロ難τιáo~ 提交于 2019-12-30 03:04:05
问题 tl;dr: To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...) instead of listStatus(...). Context Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this: /schemas_folder ├── date=20140101 │ ├── A-schema.avsc │ ├── B-schema.avsc ├── date=20140102 │ ├── A-schema.avsc │ ├── B-schema.avsc │ ├── C-schema.avsc └── date=20140103 ├── B-schema.avsc └── C

面试题_hadoop

别来无恙 提交于 2019-12-30 02:53:48
Hadoop 准备 运行hadoop集群需要哪些守护进程? DataNode,NameNode,TaskTracker和JobTracker都是运行Hadoop集群需要的守护进程。 hadoop和spark都是并行计算,那么他们有什么相同和区别? 两者都使用mr模型来进行并行计算,hadoop的一个作业称为job,job里面分为map task和reduce task,每个task都是在自己的进程中运行的,当task结束时,进程也会结束。 Spark用户提交的任务称为application,一个application对应一个SparkContext,app中存在多个job,没触发一个action操作就会产生一个job。 这些job可以并行或者串行执行,每个job有多个stage,stage是shuffle过程中DAGSchaduler通过RDD之间的依赖关系划分job而来的,每个stage里面有多个task,组成taskset有TaskSchaduler分发到各个executor中执行,executor的生命周期是和application一样的,即使没有job运行也是存在的,所以task可以快速启动读取内存进行计算的。 Hadoop的job只有map和reduce操作,表达能力比较欠缺而且在mr过程中会重复的读写hdfs,造成大量的io操作,多个job需要自己管理关系。

How does Apache Spark know about HDFS data nodes?

爱⌒轻易说出口 提交于 2019-12-30 02:45:54
问题 Imagine I do some Spark operations on a file hosted in HDFS. Something like this: var file = sc.textFile("hdfs://...") val items = file.map(_.split('\t')) ... Because in the Hadoop world the code should go where the data is, right? So my question is: How do Spark workers know of HDFS data nodes ? How does Spark know on which Data Nodes to execute the code? 回答1: Spark reuses Hadoop classes: when you call textFile , it creates a TextInputFormat which has a getSplits method (a split is roughly a