HDFS | 易学教程

HDFS : RemoteException Operation category READ is not supported in state standby.

阅读更多关于 HDFS : RemoteException Operation category READ is not supported in state standby.

1.美图 2.背景执行一个Kudu命令，然后报错 Query : CREATE TABLE TABLE_SIDE ( SEX string PRIMARY KEY , INFO string ) PARTITION BY HASH PARTITIONS 2 STORED AS KUDU TBLPROPERTIES ( 'kudu.master_addresses' = 'xx1:7051,xx2:7051,xx3:7051' , 'kudu.num_tablet_replicas' = '1' ) ERROR : ImpalaRuntimeException : Error making 'createTable' RPC to Hive Metastore : CAUSED BY : MetaException : Got exception : org . apache . hadoop . ipc . RemoteException Operation category READ is not supported in state standby . Visit https : / / s . apache . org / sbnn - error at org . apache . hadoop . hdfs . server . namenode . ha .

Outputting to a file in HDFS using a subprocess

阅读更多关于 Outputting to a file in HDFS using a subprocess

问题 I have a script that reads in text line by line, modifies the line slightly, and then outputs the line to a file. I can read the text into the file fine, the problem is that I cannot output the text. Here is my code. cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"], stdout=subprocess.PIPE) for line in cat.stdout: line = line+"Blah"; subprocess.Popen(["hadoop", "fs", "-put", "/user/test/moddedfile.txt"], stdin=line) This is the error I am getting. AttributeError: 'str'

hadoop大致问题

阅读更多关于 hadoop大致问题

一、项目编码实现 HDFS文件上传 HDFS文件下载定位文件读取通过API操作HDFS 通过IO流操作HDFS HDFS写数据流程 HDFS读数据流程统计一堆文件中单词出现的个数（WordCount案例）把单词按照ASCII码奇偶分区统计手机号耗费的总上行流量、下行流量、总流量（序列化）二、流程图及描述 HDFS写数据流程 HDFS读数据流程 NameNode&Secondary NameNode工作机制查看fsimage文件写数据流程读数据流程 namenode和secondary namenode机制查看镜像文件查看编辑日志 DataNode工作机制查看归档文件（4）解归档文件 hadoop fs -cp har:///user/my/myhar.har/* /user/hadoop 查看edits文件模拟namenode故障，并采用任一方法，恢复namenode数据集群安全模式操作1 DataNode工作机制服役新数据节点退役旧数据节点回收站配置 MapReduce程序运行流程分析安全模式回收站（参看hdfs，要与hdoop-site.xml里内容的刷新一致） 7.4 回收站 1）默认回收站默认值fs.trash.interval=0，0表示禁用回收站，可以设置删除文件的存活时间。默认值fs.trash.checkpoint

Hive数据据类型 DDL DML

阅读更多关于 Hive数据据类型 DDL DML

Hive的基本数据类型 DDL DML：基本数据类型对于Hive而言String类型相当于数据库的varchar类型，该类型是一个可变的字符串，不过它不能声明其中最多能存储多少个字符，理论上它可以存储2GB的字符数。集合数据类型数据类型描述语法示例 STRUCT 和c语言中的struct类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是STRUCT{first STRING, last STRING},那么第1个元素可以通过字段.first来引用。 struct() MAP MAP是一组键-值对元组集合，使用数组表示法可以访问数据。例如，如果某个列的数据类型是MAP，其中键->值对是’first’->’John’和’last’->’Doe’，那么可以通过字段名[‘last’]获取最后一个元素 map() ARRAY 数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素，每个数组元素都有一个编号，编号从零开始。例如，数组值为[‘John’, ‘Doe’]，那么第2个元素可以通过数组名[1]进行引用。 Array() Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似，而STRUCT与C语言中的Struct类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

Java访问kerberos认证的HDFS文件

阅读更多关于 Java访问kerberos认证的HDFS文件

Kerberos 是一种计算机网络授权协议，用来在非安全网络中，对个人通信以安全的手段进行身份认证。具体HADOOP的访问HDFS使用 Kerberos的作用和原理请自己查阅相关文档。之前做项目时第一次使用Kbs访问HDFS,当时不了解，翻阅资料搞了好久，也入了不少坑，现分享出来，方便大家。下面代码在项目亲测过，可用代码如下： package zqmKerberos; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.security.UserGroupInformation; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.text.SimpleDateFormat; import java.util.Date; import java.util.HashMap; import java.util.UUID;

Apache Spark: batch processing of files

阅读更多关于 Apache Spark: batch processing of files

问题 I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files ( 1MB ) that once processed will be more like 1KB , and then do sc.wholeTextFiles to get started with my analysis How do I loop on each file ( *.xml ) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr ) ? 回答1: I'd

Know the disk space of data nodes in hadoop?

阅读更多关于 Know the disk space of data nodes in hadoop?

问题 Is there a way or any command using which I can come to know the disk space of each datanode or the total cluster disk space? I tried the command dfs -du -h / but it seems that I do not have permission to execute it for many directories and hence cannot get the actual disk space. 回答1: From UI: http://namenode:50070/dfshealth.html#tab-datanode ---> which will give you all the details about datanode. From command line: To get disk space of each datanode: sudo -u hdfs hdfs dfsadmin -report --->

Does dataFrameWriter partitionBy shuffle the data?

阅读更多关于 Does dataFrameWriter partitionBy shuffle the data?

问题 I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this: sqlContext.read().parquet("...").write().partitionBy("...").parquet("...") I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS. 回答1: Neither parititionBy nor bucketBy

fengsong97用到的hive

阅读更多关于 fengsong97用到的hive

目录 hive介绍 hive 内外部表 hive 分区表 hive 建模 hive JDBC hive介绍 hive 内外部表 hive 内部表 MANAGED_TABLE , 是被hive完全管理的表, 完全管理元数据和数据 (默认和建议创建为内部表), 数据会被放到特定的路径下 hdfs://nameservice/user/hive/warehouse/default.db/user 这个特定路径看配置: Hive的${HIVE_HOME}/conf/hive-site.xml 里的 hive.metastore.warehouse.dir 属性指向的就是Hive表数据存放的路径简单建表示例 hive> create table default.user (id int, name string ) ; hive 外部表 EXTERNAL_TABLE ,一般先有数据,再建表用于关联原来数据的表, hive只管理元数据, 不能完全管理数据 ( insert into/overwrite 表时数据相应改变, 但直接drop 表时数据会保留在hdfs 路径里) 简单建表示例 hive>create external table default.user_e ( id int , name string ) >row format delimited >fields

Spark简介安装和简单例子

阅读更多关于 Spark简介安装和简单例子

Spark简介 Spark是一种快速、通用、可扩展的大数据分析引擎，目前，Spark生态系统已经发展成为一个包含多个子项目的集合，其中包含SparkSQL、Spark Streaming、GraphX、MLlib等子项目，Spark是基于内存计算的大数据并行计算框架。简单来说Spark是内存迭代计算，每个算子将计算结果保存在内存中，其他算子，读取这个结果，继续计算。 Spark的四个特性： 1.快 Spark实现了高效的DAG执行引擎，可以通过基于内存来高效处理数据流。 2.易用 Spark支持Java、Python和Scala的API，还支持超过80种高级算法，而且Spark支持交互式的Python和Scala的shell，可以非常方便地在这些shell中使用Spark集群来验证解决问题的方法。依赖外部数据源hdfs、本地文件.kafka.flume.mysql.ELK） 3.通用 Spark提供了统一的解决方案。Spark可以用于批处理、交互式查询（Spark SQL）、实时流处理（Spark Streaming）、机器学习（Spark MLlib）和图计算（GraphX）。这些不同类型的处理都可以在同一个应用中无缝使用。 4.兼容性 Spark可以非常方便地与其他的开源产品进行融合。比如，Spark可以使用Hadoop的YARN和Apache

订阅 HDFS