HDFS

junit及HDFS API常用方法

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-19 05:23:47
package com.haohaodata.bigdata; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import org.apache.hadoop.fs.permission.FsPermission; import org.apache.hadoop.io.IOUtils; import org.junit.After; import org.junit.Before; import org.junit.Ignore; import org.junit.Test; import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.net.URI; /** * Created by hager on 2020/1/12. * junit 及 HDFS API编程 * * 1、对于你要测试的方法,需要使用@Test这个注解 * 2、@Before和@After分别是在每个测试方式执行前后执行(即:每个测试方法都会执行一次@Before和@After) * 3、

HDFS学习(三) – Namenode and Datanode

落花浮王杯 提交于 2020-01-19 03:32:32
个人小站,正在持续整理中,欢迎访问: http://shitouer. cn 博文有更新并添加了新的内容 ,详细请访问: HDFS学习(三) – Namenode and Datanode   HDFS集群以Master-Slave模式运行,主要有两类节点:一个Namenode(即Master)和多个Datanode(即Slave)。   HDFS Architecture: Namenode   Namenode 管理者文件系统的Namespace。它维护着文件系统树(filesystem tree)以及文件树中所有的文件和文件夹的元数据(metadata)。管理这些信息的文件有两个,分别是Namespace 镜像文件(Namespace image)和操作日志文件(edit log),这些信息被Cache在RAM中,当然,这两个文件也会被持久化存储在本地硬盘。Namenode记录着每个文件中各个块所在的数据节点的位置信息,但是他并不持久化存储这些信息,因为这些信息会在系统启动时从数据节点重建。   Namenode结构图课抽象为如图:   客户端(client)代表用户与namenode和datanode交互来访问整个文件系统。客户端提供了一些列的文件系统接口,因此我们在编程时,几乎无须知道datanode和namenode,即可完成我们所需要的功能。 Datanode  

Hive基础sql语法(DDL)

◇◆丶佛笑我妖孽 提交于 2020-01-18 11:20:15
前言: 经过前面的学习 我们了解到Hive可以使用关系型数据库来存储元数据,而且Hive提供了比较完整的SQL功能 ,这篇文章主要介绍Hive基本的sql语法。 首先了解下Hive的数据存储结构,抽象图如下: 1.Database:Hive中包含了多个数据库,默认的数据库为default,对应于HDFS目录是/user/hadoop/hive/warehouse,可以通过hive.metastore.warehouse.dir参数进行配置(hive-site.xml中配置) 2.Table: Hive 中的表又分为内部表和外部表 ,Hive 中的每张表对应于HDFS上的一个目录,HDFS目录为:/user/hadoop/hive/warehouse/[databasename.db]/table 3.Partition:分区,每张表中可以加入一个分区或者多个,方便查询,提高效率;并且HDFS上会有对应的分区目录: /user/hadoop/hive/warehouse/[databasename.db]/table 4.Bucket(桶):暂且不讲 DDL操作(Data Definition Language) 参考官方文档: DDL文档 HiveQL DDL statements are documented here, including: CREATE DATABASE

Sqoop基本语法简介

风格不统一 提交于 2020-01-18 11:19:56
1.查看命令帮助 [hadoop@hadoop000 ~]$ sqoop help usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS import-mainframe Import datasets from a mainframe server to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List

hadoop异常

无人久伴 提交于 2020-01-17 23:03:18
1. org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line 显而易见,提示找不到与hadoop相关的配置文件。所以我们需要把hadoop安装目录下的“conf”子目录添加到系统环境变量PATH中: #set java environment PIG_HOME=/home/hadoop/pig-0.9.2 HBASE_HOME=/home/hadoop/hbase-0.94.3 HIVE_HOME=/home/hadoop/hive-0.9.0 HADOOP_HOME=/home/hadoop/hadoop-1.1.1 JAVA_HOME=/home/hadoop/jdk1.7.0 PATH=$JAVA_HOME/bin:$PIG_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin: $HADOOP

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

我只是一个虾纸丫 提交于 2020-01-17 18:51:09
问题 First I made two tables(RDD) to use following commands rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) ) rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) ) The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows [((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

前提是你 提交于 2020-01-17 18:50:06
问题 First I made two tables(RDD) to use following commands rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) ) rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) ) The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows [((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1

Hadoop Hive query files from hdfs

拈花ヽ惹草 提交于 2020-01-17 08:10:39
问题 If I build Hive on top of HDFS, do I need to put all the files into hive/warehouse folder before processing them? Can I query any file which is in hdfs by hive? How? 回答1: You don't have to do anything special in order to run Hive on top of your existing HDFS cluster. This happens by virtue of Hive's architecture. Hive by default runs on HDFS. do I need to put all the files into hive/warehouse folder before processing them? You don't have to do this either. When you create a Hive table and

Gzipping Har Files on HDFS using Spark

余生长醉 提交于 2020-01-17 06:41:11
问题 I have huge data in hadoop archive .har format. Since, har doesn't include any compression, I am trying to further gzip it in and store in HDFS. The only thing I can get to work without error is : harFile.coalesce(1, "true") .saveAsTextFile("hdfs://namenode/archive/GzipOutput", classOf[org.apache.hadoop.io.compress.GzipCodec]) //`coalesce` because Gzip isn't splittable. But, this doesn't give me the correct results. A Gzipped file is generated but with invalid output ( a single line saying