Hadoop

Kafka 最佳实践

别来无恙 提交于 2021-02-08 04:07:49
这是一篇关于 Kafka 实践的文章,内容来自 DataWorks Summit/Hadoop Summit(Hadoop Summit)上的一篇分享,里面讲述了很多关于 Kafka 配置、监控、优化的内容,绝对是在实践中总结出的精华,有很大的借鉴参考意义,本文主要是根据 PPT 的内容进行翻译及适当补充。 Kafka 的架构这里就不多做介绍了,直接步入正题。 Kafka 基本配置及性能优化 这里主要是 Kafka 集群基本配置的相关内容。 硬件要求 Kafka 集群基本硬件的保证 OS 调优 OS page cache:应当可以缓存所有活跃的 Segment(Kafka 中最基本的数据存储单位); fd 限制:100k+; 禁用 swapping:简单来说,swap 作用是当内存的使用达到一个临界值时就会将内存中的数据移动到 swap 交换空间,但是此时,内存可能还有很多空余资源,swap 走的是磁盘 IO,对于内存读写很在意的系统,最好禁止使用 swap 分区; TCP 调优 JVM 配置 JDK 8 并且使用 G1 垃圾收集器 至少要分配 6-8 GB 的堆内存 Kafka 磁盘存储 使用多块磁盘,并配置为 Kafka 专用的磁盘; JBOD vs RAID10; JBOD(Just a Bunch of Disks,简单来说它表示一个没有控制软件提供协调控制的磁盘集合

HDFS Command Line Append

人盡茶涼 提交于 2021-02-08 03:41:53
问题 Is there any way to append to a file on HDFS from command line like copying file: hadoop fs -copyFromLocal <localsrc> URI 回答1: This feature is implemented in Hadoop 2.3.0 as appendToFile with a syntax like: hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile (it was first suggested in 2009 when the HDFS Append feature was being contemplated: https://issues.apache.org/jira/browse/HADOOP-6239 ) 回答2: cli doesn't support append, but httpfs and fuse both has support for appending files. w301%

HDFS Command Line Append

巧了我就是萌 提交于 2021-02-08 03:41:15
问题 Is there any way to append to a file on HDFS from command line like copying file: hadoop fs -copyFromLocal <localsrc> URI 回答1: This feature is implemented in Hadoop 2.3.0 as appendToFile with a syntax like: hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile (it was first suggested in 2009 when the HDFS Append feature was being contemplated: https://issues.apache.org/jira/browse/HADOOP-6239 ) 回答2: cli doesn't support append, but httpfs and fuse both has support for appending files. w301%

阿里云轻量应用服务器和ECS云服务器区别对比选择方法

荒凉一梦 提交于 2021-02-07 21:34:26
阿里云轻量应用服务器和ECS云服务器哪个好?轻量应用服务器和ECS云服务器有什么区别?云服务器吧分享 阿里云轻量应用服务器和ECS云服务器 的区别及选择方法: 轻量应用服务器和ECS云服务器详解 ECS云服务器是阿里云的明星产品,上云必备,ECS云服务器可以结合云数据库、SLB负载均衡等产品实现高容灾高可靠性的应用架构;轻量应用服务器是轻量级的云服务器,不能搭建集群,适用于单机应用,比如单机网站应用。详细如下: 什么轻量应用服务器? 轻量应用服务器是面向单机应用场景的新一代计算服务,提供应用一键部署、一站式域名解析、网站发布、安全、运维、应用管理等服务。极大地优化了搭建简单应用的体验,降低了入门级用户使用云计算产品的门槛。更多介绍参考: 轻量应用服务器 - 阿里云 轻量应用服务器适用于:搭建小型网站、建立个人博客、建立论坛社区、构建知识效率管理工具、建立个人学习环境、搭建小型电商网站及快速搭建开发环境等。 什么是ECS云服务器? ECS云服务器是一种弹性可伸缩的计算服务,ECS是阿里云的上云必备明星产品。ECS云服务器可用于搭建各类型的企业级应用,如集群应用、网站服务器、视频弹幕等应用。ECS云服务器可以和云数据库、VPC、SLB等实例搭建集群应用。更多介绍参考: ECS云服务器 - 阿里云 ECS云服务器适用于:企业官网或轻量的Web应用、多媒体以及高并发应用或网站、高I

AWS EMR S3DistCp: The auxService:mapreduce_shuffle does not exist

久未见 提交于 2021-02-07 20:40:23
问题 I am connected to an AWS EMR v5.4.0 instance over SSH and I want to call s3distcp. This link demonstrates how to setup an emr step to call it, but when I run it I get the following error: Container launch failed for container_1492469375740_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

时间秒杀一切 提交于 2021-02-07 20:30:26
问题 I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet 回答1: You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code. Consider part-1-2_3-4.parquet : Split/Partition number. Random UUID to prevent collision between different (appending)

分布式事务有哪些解决方案?

↘锁芯ラ 提交于 2021-02-07 20:30:16
来源:http://dwz.date/eaAm 分布式事务是什么 数据库事务的特性包括原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持久性(Durabilily),简称 ACID。 在数据库执行中,多个并发执行的事务如果涉及到同一份数据的读写就容易出现数据不一致的情况,不一致的异常现象有以下几种。 脏读 ,是指一个事务中访问到了另外一个事务未提交的数据。例如事务 T1 中修改的数据项在尚未提交的情况下被其他事务(T2)读取到,如果 T1 进行回滚操作,则 T2 刚刚读取到的数据实际并不存在。 不可重复读 ,是指一个事务读取同一条记录 2 次,得到的结果不一致。例如事务 T1 第一次读取数据,接下来 T2 对其中的数据进行了更新或者删除,并且 Commit 成功。这时候 T1 再次读取这些数据,那么会得到 T2 修改后的数据,发现数据已经变更,这样 T1 在一个事务中的两次读取,返回的结果集会不一致。 幻读 ,是指一个事务读取 2 次,得到的记录条数不一致。例如事务 T1 查询获得一个结果集,T2 插入新的数据,T2 Commit 成功后,T1 再次执行同样的查询,此时得到的结果集记录数不同。 脏读、不可重复读和幻读有以下的包含关系,如果发生了脏读,那么幻读和不可重复读都有可能出现。 不同隔离级别 SQL 标准根据三种不一致的异常现象

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

倾然丶 夕夏残阳落幕 提交于 2021-02-07 20:30:05
问题 I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet 回答1: You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code. Consider part-1-2_3-4.parquet : Split/Partition number. Random UUID to prevent collision between different (appending)

“Application priority” in Yarn

自闭症网瘾萝莉.ら 提交于 2021-02-07 19:17:57
问题 I am using Hadoop 2.9.0. Is it possible to submit jobs with different priorities in YARN? According to some JIRA tickets it seems that application priorities have now been implemented. I tried using the YarnClient , and setting a priority to the ApplicationSubmissionContext before submitting the job. I also tried using the CLI and using updateApplicationPriority . However, nothing seems to be changing the application priority, it always remains 0. Have I misunderstood the concept of

How to run MapReduce tasks in Parallel with hadoop 2.x?

喜夏-厌秋 提交于 2021-02-07 19:09:58
问题 I would like my map and reduce tasks to run in parallel. However, despite trying every trick in the bag, they are still running sequentially. I read from How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce, that using the following formula, one can set the number of tasks running in parallel. min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu