HDFS | 易学教程

hadoop 2.2 - datanode doesn't start up

阅读更多关于 hadoop 2.2 - datanode doesn't start up

问题 I had Hadoop 2.4 this morning (see my previous 2 questions). Now I removed it and installed 2.2 as I had issues with 2.4, and also as I think 2.2 is the latest stable release. Now I followed the tutorial here: http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1 I am pretty sure I did everything right but I am facing similar issues again. When I run jps it is obvious that the data node is not starting up. What am I doing wrong again? Any help would be much much

Flink: possible to delete Queryable state after X time?

阅读更多关于 Flink: possible to delete Queryable state after X time?

问题 In my case, I use Flink's queryable state only. In particular, I do not care about checkpoints. Upon an event, I query the queryable state only after a maximum of X minutes. Ideally, I would delete the "old" state to save on space. That's why I wonder: can I signal Flink's state to clear itself after some time? Through configuration? Through specific event signals? How? 回答1: One way to clear state is to explicitly call clear() on the state object (e.g., a ValueState object) when you no longer

Compressed file ingestion using Flume

阅读更多关于 Compressed file ingestion using Flume

问题 Can I ingest any type of compressed file ( say zip, bzip, lz4 etc.) to hdfs using Flume ng 1.3.0? I am planning to use spoolDir. Any suggesion please. 回答1: You can ingest any type of file. You need to select an appropriate deserializer. Below route works for compressed files. You can choose the options as you need: agent.sources = src-1 agent.channels = c1 agent.sinks = k1 agent.sources.src-1.type = spooldir agent.sources.src-1.channels = c1 agent.sources.src-1.spoolDir = /tmp/myspooldir

3.Hadoop集群测试

阅读更多关于 3.Hadoop集群测试

大家如果还没配置过Hadoop的可以看我前两篇文章。验证Hadoop分布式集群首先在hdfs文件系统上创建两个目录，创建过程如下所示： hadoop fs –mkdir /data/wordconut hadoop fs –mkdir /output hdfs中的/data/wordcount用来存放Hadoop自带的WordCount例子的数据文件，程序运行的结果输出到/output/wordcount目录中，透过Web控制（ http://master:50070 ）可以发现我们成功创建了两个文件夹：接下来将本地文件的数据上传到HDFS文件夹中：透过Web控制可以发现我们成功上传了文件：也可通过hadoop的hdfs命令在控制命令终端查看信息： hadoop fs –ls /data/wordcount 运行Hadoop自带的WordCount例子，执行如下命令： hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-xeamples-2.2.0.jar wordcount /data/wordcount /output/wordcount （即hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0

HDFS space usage on fresh install

阅读更多关于 HDFS space usage on fresh install

问题 I just installed HDFS and launched the service, and there is already more than 800MB of used space. What does it represent ? $ hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://quickstart.cloudera:8020 54.5 G 823.7 M 43.4 G 1% 来源： https://stackoverflow.com/questions/43165646/hdfs-space-usage-on-fresh-install

大数据开发必须掌握的五大核心技术

阅读更多关于大数据开发必须掌握的五大核心技术

大数据技术的体系庞大且复杂，基础的技术包含数据的采集、数据预处理、分布式存储、NoSQL数据库、数据仓库、机器学习、并行计算、可视化等各种技术范畴和不同的技术层面。首先给出一个通用化的大数据处理框架，主要分为下面几个方面：数据采集与预处理、数据存储、数据清洗、数据查询分析和数据可视化。一、数据采集与预处理对于各种来源的数据，包括移动互联网数据、社交网络的数据等，这些结构化和非结构化的海量数据是零散的，也就是所谓的数据孤岛，此时的这些数据并没有什么意义，数据采集就是将这些数据写入数据仓库中，把零散的数据整合在一起，对这些数据综合起来进行分析。数据采集包括文件日志的采集、数据库日志的采集、关系型数据库的接入和应用程序的接入等。在数据量比较小的时候，可以写个定时的脚本将日志写入存储系统，但随着数据量的增长，这些方法无法提供数据安全保障，并且运维困难，需要更强壮的解决方案。 Flume NG作为实时日志收集系统，支持在日志系统中定制各类数据发送方，用于收集数据，同时，对数据进行简单处理，并写到各种数据接收方(比如文本，HDFS，Hbase等)。Flume NG采用的是三层架构：Agent层，Collector层和Store层，每一层均可水平拓展。其中Agent包含Source，Channel和 Sink，source用来消费(收集)数据源到channel组件中

Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

阅读更多关于 Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

问题 I have a folder which has 14 files in it. I run the spark-submit with 10 executors on a cluster, which has resource manager as yarn. I create my first RDD as this: JavaPairRDD<String,String> files = sc.wholeTextFiles(folderPath.toString(), 10); However, files.getNumPartitions() gives me 7 or 8, randomly. Then I do not use coalesce/repartition anywhere and I finish my DAG with 7-8 partitions. As I know, we gave argument as the "minimum" number of partitions, so that why Spark divide my RDD to

Why is hsync() not flushing my hdfs file?

阅读更多关于 Why is hsync() not flushing my hdfs file?

问题 Despite all the resources about this subject, I have issues flushing my hdfs files on disk (hadoop 2.6) Calling FSDataOutputStream.hsync() should do the trick, but it actually only works once for unknown reasons... Here is a simple unit test that fails: @Test public void test() throws InterruptedException, IOException { final FileSystem filesys = HdfsTools.getFileSystem(); final Path file = new Path("myHdfsFile"); try (final FSDataOutputStream stream = filesys.create(file)) { Assert

Troubles writing temp file on datanode with Hadoop

阅读更多关于 Troubles writing temp file on datanode with Hadoop

问题 I would like to create a file during my program. However, I don't want this file to be written on HDFS but on the datanode filesystem where the map operation is executed. I tried the following approach : public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // do some hadoop stuff, like counting words String path = "newFile.txt"; try { File f = new File(path); f.createNewFile(); } catch (IOException e) { System.out.println("Message easy to look up

Hadoop chunk size vs split vs block size

阅读更多关于 Hadoop chunk size vs split vs block size

问题 I am little bit confused about Hadoop concepts. What is the difference between Hadoop Chunk size , Split size and Block size ? Thanks in advance. 回答1: Block size & Chunk Size are same. Split size may be different to Block/Chunk size. Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two mappers. The way HDFS has been set up, it breaks down very large files into large