CSV processing in Hadoop

非 Y 不嫁゛ 提交于 2019-12-23 03:06:58

问题


I have 6 fields in a csv file:

  • first is student name (String)
  • others are student's marks like subject 1 , subject 2 etc

I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

I think there may be an alternative, and more efficient way to do this.

Has anyone got an idea of a better way to do this these operations?

Are there any inbuilt functions of hadoop which can group by student name and can calculate total marks and average associated to thaty student?


回答1:


You might want to have a look at Pig http://pig.apache.org/ which provides a simple language on top of Hadoop that lets you perform many standard tasks with much shorter code.




回答2:


Use HIVE.It simpler than writing mapreduce in java and might be me more familiar than PIG, since it's SQL like syntax.

https://cwiki.apache.org/confluence/display/Hive/Home

What you have to do is 1) install hive client in your machine or 1 node and point it to your cluster. 2) create the tables description for that file 3) load the data 4) write the SQL. Since it think your data looks like student_name, subject_mark1, subject_mark2, etc you might need to use explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

2) CREATE TABLE students(name STRING, subject1 INT,subject2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;

3) LOAD DATA INPATH '/path/to/data/students.csv' INTO TABLE students;

4) SELECT name, AVG(subject1), AVG(subject2) FROM students GROUP BY name;

output might look like:

NAME | SUBJECT1 | SUBJECT 2

john | 6.2 | 7.0

tom | 3.5 | 5.0




回答3:


You can set your reducer to run as combiner in addition to running as a reducer so you can perform interim calculation before sending all to the reducer.

As Nicolas78 said you should consider looking at pig which does a pretty good job of building an efficient map/reduce and saving you both code and effort




回答4:


I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

This can be easily written as a map only job, there is no need for a reducer. Once the mapper gets a row from the CSV, split them and calculate as required in the mapper only. And emit the student name as key and average/total etc as value.



来源:https://stackoverflow.com/questions/8630837/csv-processing-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!