Hadoop | 易学教程

Fix corrupt HDFS Files without losing data (files in the datanode still exist)

阅读更多关于 Fix corrupt HDFS Files without losing data (files in the datanode still exist)

问题 I am new to the HDFS system and I come across a HDFS question. We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005. So, the original data is replicated twice and saved under

Fix corrupt HDFS Files without losing data (files in the datanode still exist)

阅读更多关于 Fix corrupt HDFS Files without losing data (files in the datanode still exist)

Is it possible to compress json in hive external table?

阅读更多关于 Is it possible to compress json in hive external table?

问题 I want to know how to compress json data in hive external table. How can it be done? I have created external table like this: CREATE EXTERNAL TABLE tweets ( id BIGINT,created_at STRING,source STRING,favorited BOOLEAN )ROW FORMAT SERDE "com.cloudera.hive.serde.JSONSerDe" LOCATION "/user/cloudera/tweets"; and I had set the compression properties set mapred.output.compress=true; set hive.exec.compress.output=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set

零编码制作报表可能吗？

阅读更多关于零编码制作报表可能吗？

要回答这个问题，首先要明确啥程度算“零编码”？以 Excel 为例，如果把写 Excel 公式（包括复杂一些的）看做零编码；而把写 Excel VBA 看做编码的话，报表开发是可以零编码的！但是，这有个前提：在数据（集）准备好的情况下才可以零编码！为什么这么说？我们知道报表开发主要分两个阶段：第一阶段是为报表准备数据，也就是把原始数据通过 SQL/ 存储过程加工成数据集；第二阶段是使用已准备的数据编写表达式做报表呈现。在报表工具提供的 IDE 里可视化地画出报表样式，然后再填入一些把数据和单元格绑定的表达式就可以完成报表呈现了，虽然表达式可能比较复杂，但相对硬编码要简单得多（Excel 公式和 VBA 的关系）。所以说这个阶段是能做到“零编码”的。那报表数据准备怎么办？很遗憾，这个阶段没法零编码，一直以来只能硬编码，想想我们报表里写的嵌套 SQL、存储过程、JAVA 程序就知道了。为什么报表工具发展这么多年报表呈现已经完全工具化而报表数据准备的手段还这样原始呢？因为这个阶段太复杂了，不仅涉及计算逻辑的算法实现，还涉及报表性能（要知道大部分报表性能问题都是数据准备阶段引起的）。那报表数据准备是不是没办法了呢？虽然不能做到零编码，但可以朝着简单化的方向努力，将数据准备阶段也工具化，这样可以使用工具提供的便利来简化报表数据准备阶段的工作，从而进一步简化报表的开发。

how to deploy war file in spark-submit command (spark)

阅读更多关于 how to deploy war file in spark-submit command (spark)

问题 I am using spark-submit --class main.Main --master local[2] /user/sampledata/parser-0.0.1-SNAPSHOT.jar to run a java-spark code, is it possible to run this code using war file instead of jar,since i am looking to deploy it on tomcat i tried by war file but it gives class not found exception 来源： https://stackoverflow.com/questions/40734240/how-to-deploy-war-file-in-spark-submit-command-spark

Install Hive on windows: 'hive' is not recognized as an internal or external command, operable program or batch file

阅读更多关于 Install Hive on windows: 'hive' is not recognized as an internal or external command, operable program or batch file

问题 I have installed Hadoop 2.7.3 on Windows and I am able to start the cluster. Now I would like to have hive and went through the steps below: 1. Downloaded db-derby-10.12.1.1-bin.zip, unpacked it and started the startNetworkServer -h 0.0.0.0. 2. Downloaded apache-hive-1.1.1-bin.tar.gz from mirror site and unpacked it. Created hive-site.xml to have below properties: javax.jdo.option.ConnectionURL javax.jdo.option.ConnectionDriverName hive.server2.enable.impersonation hive.server2.authentication

Install Hive on windows: 'hive' is not recognized as an internal or external command, operable program or batch file

阅读更多关于 Install Hive on windows: 'hive' is not recognized as an internal or external command, operable program or batch file

Java 简单操作hdfs API

阅读更多关于 Java 简单操作hdfs API

注：图片如果损坏，点击文章链接： https://www.toutiao.com/i6632047118376780295/ 启动Hadoop出现问题：datanode的clusterID 和 namenode的clusterID 不匹配从日志中可以看出，原因是因为datanode的clusterID 和 namenode的clusterID 不匹配。打开hdfs-site.xml里配置的datanode和namenode对应的目录，分别打开current文件夹里的VERSION，可以看到clusterID项正如日志里记录的一样，确实不一致，修改datanode里VERSION文件的clusterID 与namenode里的一致，再重新启动dfs（执行start-dfs.sh）再执行jps命令可以看到datanode已正常启动。出现该问题的原因：在第一次格式化dfs后，启动并使用了hadoop，后来又重新执行了格式化命令（hdfs namenode -format)，这时namenode的clusterID会重新生成，而datanode的clusterID 保持不变。验证伪分布环境是否完成 Java操作hdfs 新创建一个maven项目编写pom文件编写测试代码我们运行一下看一看这种简单的写法是本地模式，所以我们去看下本地文件是不是有了

hdfs java api操作

阅读更多关于 hdfs java api操作

代码地址：https://github.com/zengfa1988/study/blob/master/src/main/java/com/study/hadoop/hdfs/HdfsTest.java 1，导入jar包用maven构建项目，添加pom文件： <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.1</version> </dependency> 测试时可导入Junit： <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.9</version> </dependency> 2，获取文件系统 hadoop的文件系统操作类基本都在org.apache.hadoop.fs中所有的操作都是通过抽象的文件系统FileSystem，要拿到具体实现类进行操作，下图是FileSystem所有的实现类，常用的DistributedFileSystem（分布式文件系统）、LocalFileSystem（本地文件系统）。 FileSystem的两个静态工厂方法可以得到具体实现类： public static

HDFS JAVA API

阅读更多关于 HDFS JAVA API

HDFS JAVA API 实验目的 1.掌握HDFS JAVA API的 2.了解JAVA API的执行流程实验原理 1.HDFS（Hadoop Distributed File System）是Hadoop项目的核心子项目，是分布式计算中数据存储管理的基础篇，为了实现本地与HDFS的文件传输，主要借助Eclipse开发环境，通过java编程实现了远程HDFS的文件创建，上传，下载，删除等。其实对HDSF的文件操作主要有两种方式：命令行的方式和JavaAPI的方式。命令行的方式简单直接，但是必须要求本地机器也是在Linux系统中已经安装了hadoop，这对习惯用windows系统的用户来说不得不安装虚拟机，然后再在虚拟机上安装Linux系统，这是一种挑战。同时windows系统与虚拟机上安装的Linux系统进行文件传输也是要借助一些工具才可以实现。为了实现以上所遇到诸如系统不一致，手动输入命令等的困扰，我们选择Java API的方式，有专门的API函数，可以在非Hadoop机器上实现访问，同时与系统无关（windows、Linux甚至XP系统也可以）。Hadoop中关于文件操作类基本上全部是在"org.apache.hadoop.fs"包中，Hadoop类库中最终面向用户提供的接口类是FileSystem，该类封装了几乎所有的文件操作，例如CopyToLocalFile

订阅 Hadoop