HDFS

hadoop 2.2 - datanode doesn't start up

a 夏天 提交于 2020-01-03 02:58:29
问题 I had Hadoop 2.4 this morning (see my previous 2 questions). Now I removed it and installed 2.2 as I had issues with 2.4, and also as I think 2.2 is the latest stable release. Now I followed the tutorial here: http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1 I am pretty sure I did everything right but I am facing similar issues again. When I run jps it is obvious that the data node is not starting up. What am I doing wrong again? Any help would be much much

Flink: possible to delete Queryable state after X time?

廉价感情. 提交于 2020-01-03 02:56:07
问题 In my case, I use Flink's queryable state only. In particular, I do not care about checkpoints. Upon an event, I query the queryable state only after a maximum of X minutes. Ideally, I would delete the "old" state to save on space. That's why I wonder: can I signal Flink's state to clear itself after some time? Through configuration? Through specific event signals? How? 回答1: One way to clear state is to explicitly call clear() on the state object (e.g., a ValueState object) when you no longer

Compressed file ingestion using Flume

你说的曾经没有我的故事 提交于 2020-01-03 02:28:07
问题 Can I ingest any type of compressed file ( say zip, bzip, lz4 etc.) to hdfs using Flume ng 1.3.0? I am planning to use spoolDir. Any suggesion please. 回答1: You can ingest any type of file. You need to select an appropriate deserializer. Below route works for compressed files. You can choose the options as you need: agent.sources = src-1 agent.channels = c1 agent.sinks = k1 agent.sources.src-1.type = spooldir agent.sources.src-1.channels = c1 agent.sources.src-1.spoolDir = /tmp/myspooldir

3.Hadoop集群测试

自作多情 提交于 2020-01-02 19:03:12
大家如果还没配置过Hadoop的可以看我前两篇文章。 验证Hadoop分布式集群 首先在hdfs文件系统上创建两个目录,创建过程如下所示: hadoop fs –mkdir /data/wordconut hadoop fs –mkdir /output hdfs中的/data/wordcount用来存放Hadoop自带的WordCount例子的数据文件,程序运行的结果输出到/output/wordcount目录中,透过Web控制( http://master:50070 )可以发现我们成功创建了两个文件夹: 接下来将本地文件的数据上传到HDFS文件夹中: 透过Web控制可以发现我们成功上传了文件: 也可通过hadoop的hdfs命令在控制命令终端查看信息: hadoop fs –ls /data/wordcount 运行Hadoop自带的WordCount例子,执行如下命令: hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-xeamples-2.2.0.jar wordcount /data/wordcount /output/wordcount (即hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0

HDFS space usage on fresh install

一个人想着一个人 提交于 2020-01-02 18:57:16
问题 I just installed HDFS and launched the service, and there is already more than 800MB of used space. What does it represent ? $ hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://quickstart.cloudera:8020 54.5 G 823.7 M 43.4 G 1% 来源: https://stackoverflow.com/questions/43165646/hdfs-space-usage-on-fresh-install

大数据开发必须掌握的五大核心技术

痴心易碎 提交于 2020-01-02 17:07:53
大数据技术的体系庞大且复杂,基础的技术包含数据的采集、数据预处理、分布式存储、NoSQL数据库、数据仓库、机器学习、并行计算、可视化等各种技术范畴和不同的技术层面。首先给出一个通用化的大数据处理框架,主要分为下面几个方面:数据采集与预处理、数据存储、数据清洗、数据查询分析和数据可视化。 一、数据采集与预处理 对于各种来源的数据,包括移动互联网数据、社交网络的数据等,这些结构化和非结构化的海量数据是零散的,也就是所谓的数据孤岛,此时的这些数据并没有什么意义,数据采集就是将这些数据写入数据仓库中,把零散的数据整合在一起,对这些数据综合起来进行分析。数据采集包括文件日志的采集、数据库日志的采集、关系型数据库的接入和应用程序的接入等。在数据量比较小的时候,可以写个定时的脚本将日志写入存储系统,但随着数据量的增长,这些方法无法提供数据安全保障,并且运维困难,需要更强壮的解决方案。 Flume NG作为实时日志收集系统,支持在日志系统中定制各类数据发送方,用于收集数据,同时,对数据进行简单处理,并写到各种数据接收方(比如文本,HDFS,Hbase等)。Flume NG采用的是三层架构:Agent层,Collector层和Store层,每一层均可水平拓展。其中Agent包含Source,Channel和 Sink,source用来消费(收集)数据源到channel组件中

Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-02 10:00:59
问题 I have a folder which has 14 files in it. I run the spark-submit with 10 executors on a cluster, which has resource manager as yarn. I create my first RDD as this: JavaPairRDD<String,String> files = sc.wholeTextFiles(folderPath.toString(), 10); However, files.getNumPartitions() gives me 7 or 8, randomly. Then I do not use coalesce/repartition anywhere and I finish my DAG with 7-8 partitions. As I know, we gave argument as the "minimum" number of partitions, so that why Spark divide my RDD to

Why is hsync() not flushing my hdfs file?

帅比萌擦擦* 提交于 2020-01-02 08:35:11
问题 Despite all the resources about this subject, I have issues flushing my hdfs files on disk (hadoop 2.6) Calling FSDataOutputStream.hsync() should do the trick, but it actually only works once for unknown reasons... Here is a simple unit test that fails: @Test public void test() throws InterruptedException, IOException { final FileSystem filesys = HdfsTools.getFileSystem(); final Path file = new Path("myHdfsFile"); try (final FSDataOutputStream stream = filesys.create(file)) { Assert

Troubles writing temp file on datanode with Hadoop

邮差的信 提交于 2020-01-02 07:43:06
问题 I would like to create a file during my program. However, I don't want this file to be written on HDFS but on the datanode filesystem where the map operation is executed. I tried the following approach : public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // do some hadoop stuff, like counting words String path = "newFile.txt"; try { File f = new File(path); f.createNewFile(); } catch (IOException e) { System.out.println("Message easy to look up

Hadoop chunk size vs split vs block size

本小妞迷上赌 提交于 2020-01-02 07:03:43
问题 I am little bit confused about Hadoop concepts. What is the difference between Hadoop Chunk size , Split size and Block size ? Thanks in advance. 回答1: Block size & Chunk Size are same. Split size may be different to Block/Chunk size. Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two mappers. The way HDFS has been set up, it breaks down very large files into large