Finding total number of lines in hdfs distributed file using command line

问题

I am working on a cluster where a dataset is kept in hdfs in distributed manner. Here is what I have:

[hmi@bdadev-5 ~]$ hadoop fs -ls /bdatest/clm/data/
Found 1840 items
-rw-r--r--   3 bda supergroup          0 2015-08-11 00:32 /bdatest/clm/data/_SUCCESS
-rw-r--r--   3 bda supergroup   34404390 2015-08-11 00:32 /bdatest/clm/data/part-00000
-rw-r--r--   3 bda supergroup   34404062 2015-08-11 00:32 /bdatest/clm/data/part-00001
-rw-r--r--   3 bda supergroup   34404259 2015-08-11 00:32 /bdatest/clm/data/part-00002
....
....

The data is of the form:

[hmi@bdadev-5 ~]$ hadoop fs -cat /bdatest/clm/data/part-00000|head
V|485715986|1|8ca217a3d75d8236|Y|Y|Y|Y/1X||Trimode|SAMSUNG|1x/Trimode|High|Phone|N|Y|Y|Y|N|Basic|Basic|Basic|Basic|N|N|N|N|Y|N|Basic-Communicator|Y|Basic|N|Y|1X|Basic|1X|||SAM|Other|SCH-A870|SCH-A870|N|N|M2MC|

So, what I want to do is to count the total number of lines in the original data file data. My understanding is that the distributed chunks like part-00000, part-00001 etc have overlaps. So just counting the number of lines in part-xxxx files and summing them won't work. Also the original dataset data is of size ~70GB. How can I efficiently find out the total number of lines?

回答1:

More efficiently -- you can use spark to count the no. of lines. The following code snippet helps to count the number of lines.

text_file = spark.textFile("hdfs://...")
count = text_file.count();
count.dump();

This displays the count of no. of lines.

Note: The data in different part files will not overlap

Using hdfs dfs -cat /bdatest/clm/data/part-* | wc -l will also give you the output but this will dump all the data to the local machine and takes longer time.

Best solution is to use MapReduce or spark. MapReduce will take longer time to develop and execute. If the spark is installed, this is the best choice.

回答2:

If you need to just find the number of lines in data. You can use the following command:

hdfs dfs -cat /bdatest/clm/data/part-* | wc -l

Also you can write a simple mapreduce program with identity mapper which emits the input as output. Then you check the counters and find the input records for mapper. That will be number of lines in your data.

回答3:

Hadoop one liner:

hadoop fs -cat /bdatest/clm/data/part-* | wc -l

Source: http://www.sasanalysis.com/2014/04/10-popular-linux-commands-for-hadoop.html

Another approach would be to create a map reduce job where the mapper emits 1 for each line and the reducer sums the values. See the accepted answer of Writing MApreduce code for counting number of records for the solution.

回答4:

This is such a common task that I wish there is a subcommand in fs to do that (e.g. hadoop fs -wc -l inputdir) to avoid streaming all the content to one machine that performs the "wc -l" command.

To count lines efficiently, I often use hadoop streaming and unix commands as follows:

hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=1 \
  -input inputdir \
  -output outputdir \
  -mapper "bash -c 'paste <(echo "count") <(wc -l)'" \
  -reducer "bash -c 'cut -f2 | paste -sd+ | bc'"

Every mapper will run "wc -l" on the parts it has and then a single reducer will sum up the counts from all the mappers.

回答5:

If you have a very big file with about same line content (I imagine a JSON or a log entry), and you don't care about precision, you could calculate it.

Example, I store raw JSON in a file:

Size of the file: 750Mo Size of first line: 752 chars (==> 752 octets)

Lines => about 1.020.091

Running cat | wc -l gives 1.018.932

Not so bad ^^

回答6:

You can use hadoop streaming for this problem.

This is how you run it :

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar -input <dir> -output <dir> counter_mapper.py -reducer counter_reducery -file counter_mapper.py -file counter_reducer.py

counter_mapper.py

#!/usr/bin/env python

import sys
count = 0

for line in sys.stdin:
    count = count + 1

print count

counter_reducer.py

#!/usr/bin/env python

import sys
count = 0

for line in sys.stdin:
    count = count +int(line)

print count

来源：https://stackoverflow.com/questions/32079372/finding-total-number-of-lines-in-hdfs-distributed-file-using-command-line

标签

unix

Hadoop

apache-spark

HDFS