HDFS

“Connection refused” Error for Namenode-HDFS (Hadoop Issue)

不问归期 提交于 2020-01-01 09:19:45
问题 All my nodes are up and running when we see using jps command, but still I am unable to connect to hdfs filesystem. Whenever I click on Browse the filesystem on the Hadoop Namenode localhost:8020 page, the error which i get is Connection Refused . Also I have tried formatting and restarting the namenode but still the error persist. Can anyone please help me solving this issue. 回答1: Check whether all your services are running JobTracker, Jps, NameNode. DataNode, TaskTracker by running jps

How to subtract months from date in HIVE

不羁岁月 提交于 2020-01-01 06:36:30
问题 I am looking for a method that helps me subtract months from a date in HIVE I have a date 2015-02-01 . Now i need to subtract 2 months from this date so that result should be 2014-12-01 . Can you guys help me out here? 回答1: select add_months('2015-02-01',-2); if you need to go back to first day of the resulting month: select add_months(trunc('2015-02-01','MM'),-2); 回答2: Please try add_months date function and pass -2 as months. Internally add_months uses Java Calendar.add method, which

Sqoop -- 用于Hadoop与关系数据库间数据导入导出工作的工具

心不动则不痛 提交于 2019-12-31 17:05:49
转: https://blog.csdn.net/qx12306/article/details/67014096 Sqoop是一款开源的工具,主要用于在Hadoop相关存储(HDFS、Hive、HBase)与传统关系数据库(MySql、Oracle等)间进行数据传递工作。Sqoop最早是作为Hadoop的一个第三方模块存在,后来被独立成为了一个Apache项目。除了关系数据库外,对于某些NoSQL数据库,Sqoop也提供了连接器。 一、Sqoop基础知识   Sqoop项目开始于2009年,可以在Hadoop相关存储与传统关系数据库之间进行数据导入导出工作。Sqoop会开启多个MapReduce任务来并行进行数据导入导出工作,提高工作效率。 二、Sqoop安装   本实例安装版本: sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz 。 1、将安装文件上传到/usr/local/目录下并解压,然后重命名为sqoop。 2、配置环境变量,执行命令:vi /etc/profile,增加export $SQOOP_HOME=/usr/local/bin,并在export PATH中增加$PIG_HOME/bin,然后执行命令:source /etc/profile使配置文件立即生效。 3、将需要连接的数据库驱动文件拷贝至lib目录下

Read a text file from HDFS line by line in mapper

天大地大妈咪最大 提交于 2019-12-31 14:46:15
问题 Is the following code for Mappers, reading a text file from HDFS right? And if it is: What happens if two mappers in different nodes try to open the file at almost the same time? Isn't there a need to close the InputStreamReader ? If so, how to do it without closing the filesystem? My code is: Path pt=new Path("hdfs://pathTofile"); FileSystem fs = FileSystem.get(context.getConfiguration()); BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt))); String line; line=br.readLine

PySpark: read, map and reduce from multiline record textfile with newAPIHadoopFile

情到浓时终转凉″ 提交于 2019-12-31 04:45:09
问题 I'm trying so solve a problem that is kind of similar to this post. My original data is a text file that contains values (observations) of several sensors. Each observation is given with a timestamp but the sensor name is given only once, and not in each line. But there a several sensors in one file. Time MHist::852-YF-007 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 2016-05-09 23:57:00 0 2016-05-09 23:56:00 0 2016-05-09 23:55:00 0 2016-05-09 23:54:00 0 2016-05-09 23:53

Upload data to HDFS with Java API

岁酱吖の 提交于 2019-12-31 03:44:07
问题 I've searched for some time now and none of the solutions seem to work for me. Pretty straightforward - I want to upload data from my local file system to HDFS using the Java API. The Java program will be run on a host that has been configured to talk to a remote Hadoop cluster through shell (i.e. hdfs dfs -ls , etc.). I have included the below dependencies in my project: hadoop-core:1.2.1 hadoop-common:2.7.1 hadoop-hdfs:2.7.1 I have code that looks like the following: File localDir = ...;

How to explicilty define datanodes to store a particular given file in HDFS?

六眼飞鱼酱① 提交于 2019-12-31 02:41:06
问题 I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks. for example: Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ). there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2. Similarly one of three blocks of file02 at slave node1, second one at

hadoop之MapReduce学习

淺唱寂寞╮ 提交于 2019-12-31 01:11:01
为什么需要MapReduce 进行分析数据,计算方便和复用性强;而且是文件级别的 进程包括三个 mrappmaster:负责整个程序管理 maptask:负责map阶段的整个过程 reducemask:负责reduce阶段的整个过程 为什么需要把jar包上传到集群上? 因为不止一个节点需要用到jar包,放到本地不能够被使用,因此放到集群上, namenode会告诉需要的节点,jar包所在位置 hadoop解决问题? 主要解决,海量数据的存储和海量数据的分析计算问题。 Hadoop三大发行版本? 三大发行版本: Apache、Cloudera、Hortonworks Apache版本最原始(最基础)的版本,对于入门学习最好。 Cloudera在大型互联网企业中用的较多。主要为CDH Hortonworks文档较好。 Hadoop的优势 1、高可靠2、高扩展性3、高效性4、高容错性 Hadoop组成 1)Hadoop HDFS:一个高可靠、高吞吐量的分布式文件系统。 2)Hadoop MapReduce:一个分布式的离线并行计算框架。 3)Hadoop YARN:作业调度与集群资源管理的框架。 4)Hadoop Common:支持其他模块的工具模块。 yarn架构 1)ResourceManager(rm)、2)NodeManager(nm)、3)ApplicationMaster、4

sqoop job shell script execute parallel in oozie

喜你入骨 提交于 2019-12-30 14:49:44
问题 I have a shell script which executes sqoop job . The script is below. !#/bin/bash table=$1 sqoop job --exec ${table} Now when I pass the table name in the workflow I get the sqoop job to be executed successfully. The workflow is below. <workflow-app name="Shell_script" xmlns="uri:oozie:workflow:0.5"> <start to="shell"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="shell_script"> <shell xmlns="uri:oozie:shell

Apache Spark on HDFS: read 10k-100k of small files at once

梦想的初衷 提交于 2019-12-30 11:53:29
问题 I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below: // return a list of paths to small files List<Sting> paths = getAllPaths(); // read up to 100000 small files at once into memory sparkSession .read() .parquet(paths) .as(Encoders.kryo(SmallFileWrapper.class)) .coalesce(numPartitions); Problem The number of small files is not a problem from the perspective of memory