How to process a flat file with JSON string as a part of each line, into CSV file Using PIG Loader?

问题

I have a file in HDFS as

44,UK,{"names":{"name1":"John","name2":"marry","name3":"stuart"},"fruits":{"fruit1":"apple","fruit2":"orange"}},31-07-2016

91,INDIA,{"names":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016

and want to store this into a SCV file as below using PIG loader :

44,UK,names,name1,John,31-07-2016
44,UK,names,name2,Marry,31-07-2016
..
44,UK,fruit,fruit1,apple,31-07-2016
..
91,INDIA,names,name1,Ram,31-07-2016
..
91,INDIA,null,null,Ram,31-07-2016

What should be the PIG script for this ?

回答1:

Since your record is not a proper JSON string any json storer/loader will not help you. Writing a UDF will be a simpler approach.

UPDATED APPROACH 1 :-

Below UDF and PIG script will work if you are converting your input to tab separated file.

UDF :-

package com.test.udf;

import org.apache.commons.lang3.StringUtils;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
*input format :-
*  {"names":{"name1":"John","name2":"marry","name3":"stuart"},"fruits":    {"fruit1":"apple","fruit2":"orange"}}
*/
public class jsonToTuples extends EvalFunc<DataBag> {
ObjectMapper objectMapper = new ObjectMapper();
TypeReference typeRef = new TypeReference<HashMap<String, Object>>() {
};

@Override
public DataBag exec(Tuple input) throws IOException {
    if (input == null || input.size() == 0) {
        return null;
    } else {
        String jsonRecord = (String) input.get(0);
        if (StringUtils.isNotBlank(jsonRecord)) {
            try {
                List<String> recordList = new ArrayList<String>();
                Map<String, Object> jsonDataMap = objectMapper.readValue(jsonRecord, typeRef);
                if(jsonDataMap.get("names") != null) {
                    Map<String, String> namesDataMap  = (Map<String, String>) jsonDataMap.get("names");
                    for(String key : namesDataMap.keySet()){
                        recordList.add("names" + "," + key + "," + namesDataMap.get(key));
                    }

                }
                if(jsonDataMap.get("fruits") != null) {
                    Map<String, String> fruitsDataMap  = (Map<String, String>) jsonDataMap.get("fruits");
                    for(String key : fruitsDataMap.keySet()){
                        recordList.add("fruits" + "," + key + "," + fruitsDataMap.get(key));
                    }

                }
                DataBag outputBag = BagFactory.getInstance().newDefaultBag();
                for( int i = 0 ; i < recordList.size() ; i++){
                    Tuple outputTuple = TupleFactory.getInstance().newTuple(1);
                    outputTuple.set(0 , recordList.get(i));
                    outputBag.add(outputTuple);
                }

                return outputBag;
            }catch(Exception e){
                System.out.println("caught exception for ");
                e.printStackTrace();
                return null;
            }
        }
    }
    return null;
    }
}

PIG SCRIPT :-

register 'testUDF.jar' ;
A = load 'data.txt' using PigStorage() as (id:chararray , country:chararray , record:chararray , date:chararray);
B = Foreach A generate id, country , FLATTEN(com.test.udf.jsonToTuples(record)) , date ;
dump B ;

OLD APPROACH :-

Below I am mentioning the way I will use in my UDF to read your record if it is comma separated.

As mentined in my below comment try to use magic of split in UDF to separate your fields. I have not tested but here is what I may try in my UDF :-

(please note that I am not sure this is best option - you may want to improve it further.)

String[] strSplit = ((String) input.get(0)).split("," , 3);
String id = strSplit[0] ;
String country = strSplit[1] ;
String jsonWithDate = strSplit[2] ;

String[] datePart =  ((String) input.get(0)).split(",");    
String date = datePart[datePart.length-1];

/**
 * above jsonWithDate should look like -
 * {"names":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016
 * 
*/
String jsonString = jsonWithDate.replace(date,"").replace(",$", "");

/**
* now use some parser or object mapper to convert jsonString to desired list of values.
*/

来源：https://stackoverflow.com/questions/38685850/how-to-process-a-flat-file-with-json-string-as-a-part-of-each-line-into-csv-fil

标签

csv

apache-pig

HDFS