CSV processing in Hadoop

问题

I have 6 fields in a csv file:

first is student name (String)
others are student's marks like subject 1 , subject 2 etc

I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

I think there may be an alternative, and more efficient way to do this.

Has anyone got an idea of a better way to do this these operations?

Are there any inbuilt functions of hadoop which can group by student name and can calculate total marks and average associated to thaty student?

回答1:

You might want to have a look at Pig http://pig.apache.org/ which provides a simple language on top of Hadoop that lets you perform many standard tasks with much shorter code.

回答2:

Use HIVE.It simpler than writing mapreduce in java and might be me more familiar than PIG, since it's SQL like syntax.

https://cwiki.apache.org/confluence/display/Hive/Home

What you have to do is 1) install hive client in your machine or 1 node and point it to your cluster. 2) create the tables description for that file 3) load the data 4) write the SQL. Since it think your data looks like student_name, subject_mark1, subject_mark2, etc you might need to use explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

2) CREATE TABLE students(name STRING, subject1 INT,subject2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;

3) LOAD DATA INPATH '/path/to/data/students.csv' INTO TABLE students;

4) SELECT name, AVG(subject1), AVG(subject2) FROM students GROUP BY name;

output might look like:

NAME | SUBJECT1 | SUBJECT 2

john | 6.2 | 7.0

tom | 3.5 | 5.0

回答3:

You can set your reducer to run as combiner in addition to running as a reducer so you can perform interim calculation before sending all to the reducer.

As Nicolas78 said you should consider looking at pig which does a pretty good job of building an efficient map/reduce and saving you both code and effort

回答4:

I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

This can be easily written as a map only job, there is no need for a reducer. Once the mapper gets a row from the CSV, split them and calculate as required in the mapper only. And emit the student name as key and average/total etc as value.

来源：https://stackoverflow.com/questions/8630837/csv-processing-in-hadoop

标签

java

csv

Hadoop

MapReduce