环境:Linux + hadoop python3
需要注意python不同版本的语法;
解决的问题:对文本文件进行词频统计;
hadoop mapreduce计算流程
inputdata->HDFS ->datasplit ->map-(shuffer&sort)->reudce->output(HDFS)
任意文本文件
需要准备的脚本:
map.py reduce.py run.sh
(base) [root@pyspark mapreduce]# cat map.py
import sys
for line in sys.stdin:
wordlist = line.strip().split(' ')
for word in wordlist:
print('\t'.join([word.strip(),'1']))
--------------------------------------------------------------------------
(base) [root@pyspark mapreduce]# cat reduce.py
import sys
cur_word = None
sum = 0
for line in sys.stdin:
wordlist =line.strip().split('\t')
if len(wordlist) !=2:
continue
word,cnt = wordlist
if cur_word == None:
cur_word = word
if cur_word != word:
print('\t'.join([cur_word,str(sum)]))
cur_word = word
sum = 0
sum +=int(cnt)
print('\t'.join([cur_word,str(sum)]))
----------------------------------------------------------------
(base) [root@pyspark wordcount]# cat run.sh
#!/root/anaconda3/bin/python
HADOOP_CMD="/root/hadoop/hadoop-2.9.2/bin/hadoop"
STREAM_JAR_PATH="/root/hadoop/hadoop-2.9.2/hadoop-streaming-2.9.2.jar"
INPUT_FILE_PATH_1="/1.data"
OUTPUT_PATH="/output"
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python maptest.py" \
-reducer "python reducetest.py" \
-file ./maptest.py \
-file ./reducetest.py
-----------------------------------------------------------------
(base) [root@pyspark wordcount]# sh run.sh
rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted /output
19/11/10 23:49:42 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./maptest.py, ./reducetest.py, /tmp/hadoop-unjar5889192628432952141/] [] /tmp/streamjob7228039301823824466.jar tmpDir=null
19/11/10 23:49:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/10 23:49:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/10 23:49:47 INFO mapred.FileInputFormat: Total input files to process : 1
19/11/10 23:49:48 INFO mapreduce.JobSubmitter: number of splits:2
19/11/10 23:49:48 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/11/10 23:49:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573374330013_0006
19/11/10 23:49:49 INFO impl.YarnClientImpl: Submitted application application_1573374330013_0006
19/11/10 23:49:49 INFO mapreduce.Job: The url to track the job: http://pyspark:8088/proxy/application_1573374330013_0006/
19/11/10 23:49:49 INFO mapreduce.Job: Running job: job_1573374330013_0006
19/11/10 23:50:04 INFO mapreduce.Job: Job job_1573374330013_0006 running in uber mode : false
19/11/10 23:50:04 INFO mapreduce.Job: map 0% reduce 0%
19/11/10 23:50:18 INFO mapreduce.Job: map 50% reduce 0%
19/11/10 23:50:19 INFO mapreduce.Job: map 100% reduce 0%
19/11/10 23:50:30 INFO mapreduce.Job: map 100% reduce 100%
19/11/10 23:50:32 INFO mapreduce.Job: Job job_1573374330013_0006 completed successfully
19/11/10 23:50:33 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=3229845
FILE: Number of bytes written=7065383
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1900885
HDFS: Number of bytes written=183609
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=1
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=24785
Total time spent by all reduces in occupied slots (ms)=9853
Total time spent by all map tasks (ms)=24785
Total time spent by all reduce tasks (ms)=9853
Total vcore-milliseconds taken by all map tasks=24785
Total vcore-milliseconds taken by all reduce tasks=9853
Total megabyte-milliseconds taken by all map tasks=25379840
Total megabyte-milliseconds taken by all reduce tasks=10089472
Map-Reduce Framework
Map input records=8598
Map output records=335454
Map output bytes=2558931
Map output materialized bytes=3229851
Input split bytes=168
Combine input records=0
Combine output records=0
Reduce input groups=16985
Reduce shuffle bytes=3229851
Reduce input records=335454
Reduce output records=16984
Spilled Records=670908
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=687
CPU time spent (ms)=10610
Physical memory (bytes) snapshot=731275264
Virtual memory (bytes) snapshot=6418624512
Total committed heap usage (bytes)=504365056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1900717
File Output Format Counters
Bytes Written=183609
19/11/10 23:50:33 INFO streaming.StreamJob: Output directory: /output
---------------------------------------------------------------------------------------------------
Hadoop Python MapReduce
-----------------------------------------------------------------------------------------------
(base) [root@pyspark wordcount]# hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 1 root supergroup 0 2019-11-10 23:50 /output/_SUCCESS
-rw-r--r-- 1 root supergroup 183609 2019-11-10 23:50 /output/part-00000