Convert XML file to CSV in java

后端 未结 8 881
时光说笑
时光说笑 2020-12-05 01:17

@Before There will be probably some duplicate questions suggestions, I don\'t think that is the case maybe read this first, I\'ll try to be as brief as possible. Title gives

相关标签:
8条回答
  • 2020-12-05 01:44

    The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.

    I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).

    First pass: Establish number of header columns

    Second pass: Output CSV

    I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.

    In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).

    The second pass, having established the number of target columns, will do the actual CSV output.

    Based on your example XML my code sketch would produce:

    ItemID,StartTime,EndTime,ViewItemURL,AverageTime,category,category,type,type,AveragePrice
    4504216603,10:00:10.000Z,10:00:30.000Z,http://url,,,,,,
    4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,9823,9112,TX,TY,value2
    

    Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!

    import com.google.common.collect.LinkedHashMultimap;
    import java.io.FileNotFoundException;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.LinkedHashMap;
    import java.util.Map.Entry;
    import org.xml.sax.Attributes;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.DefaultHandler;
    import org.xml.sax.helpers.XMLReaderFactory;
    
    public class App {
    
        public static void main(String[] args) throws SAXException, FileNotFoundException, IOException {
            // First pass - to determine headers
            XMLReader xr = XMLReaderFactory.createXMLReader();
            HeaderHandler handler = new HeaderHandler();
            xr.setContentHandler(handler);
            xr.setErrorHandler(handler);
            FileReader r = new FileReader("test1.xml");
            xr.parse(new InputSource(r));
    
            LinkedHashMap<String, Integer> headers = handler.getHeaders();
            int totalnumberofcolumns = 0;
            for (int headercount : headers.values()) {
                totalnumberofcolumns += headercount;
            }
            String[] columnheaders = new String[totalnumberofcolumns];
            int i = 0;
            for (Entry<String, Integer> entry : headers.entrySet()) {
                for (int j = 0; j < entry.getValue(); j++) {
                    columnheaders[i] = entry.getKey();
                    i++;
                }
            }
            StringBuilder sb = new StringBuilder();
            for (String h : columnheaders) {
                sb.append(h);
                sb.append(',');
            }
            System.out.println(sb.substring(0, sb.length() - 1));
    
            // Second pass - collect and output data
    
            xr = XMLReaderFactory.createXMLReader();
    
            DataHandler datahandler = new DataHandler();
            datahandler.setHeaderArray(columnheaders);
    
            xr.setContentHandler(datahandler);
            xr.setErrorHandler(datahandler);
            r = new FileReader("test1.xml");
            xr.parse(new InputSource(r));
        }
    
        public static class HeaderHandler extends DefaultHandler {
    
            private String content;
            private String currentElement;
            private boolean insideElement = false;
            private Attributes attribs;
            private LinkedHashMap<String, Integer> itemHeader;
            private LinkedHashMap<String, Integer> accumulativeHeader = new LinkedHashMap<String, Integer>();
    
            public HeaderHandler() {
                super();
            }
    
            private LinkedHashMap<String, Integer> getHeaders() {
                return accumulativeHeader;
            }
    
            private void addItemHeader(String headerName) {
                if (itemHeader.containsKey(headerName)) {
                    itemHeader.put(headerName, itemHeader.get(headerName) + 1);
                } else {
                    itemHeader.put(headerName, 1);
                }
            }
    
            @Override
            public void startElement(String uri, String name,
                    String qName, Attributes atts) {
                if ("item".equalsIgnoreCase(qName)) {
                    itemHeader = new LinkedHashMap<String, Integer>();
                }
                currentElement = qName;
                content = null;
                insideElement = true;
                attribs = atts;
            }
    
            @Override
            public void endElement(String uri, String name, String qName) {
                if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                    if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                        addItemHeader(qName);
                    }
                    if (attribs != null) {
                        int attsLength = attribs.getLength();
                        if (attsLength > 0) {
                            for (int i = 0; i < attsLength; i++) {
                                String attName = attribs.getLocalName(i);
                                addItemHeader(attName);
                            }
                        }
                    }
                }
                if ("item".equalsIgnoreCase(qName)) {
                    for (Entry<String, Integer> entry : itemHeader.entrySet()) {
                        String headerName = entry.getKey();
                        Integer count = entry.getValue();
                        //System.out.println(entry.getKey() + ":" + entry.getValue());
                        if (accumulativeHeader.containsKey(headerName)) {
                            if (count > accumulativeHeader.get(headerName)) {
                                accumulativeHeader.put(headerName, count);
                            }
                        } else {
                            accumulativeHeader.put(headerName, count);
                        }
                    }
                }
                insideElement = false;
                currentElement = null;
                attribs = null;
            }
    
            @Override
            public void characters(char ch[], int start, int length) {
                if (insideElement) {
                    content = new String(ch, start, length);
                }
            }
        }
    
        public static class DataHandler extends DefaultHandler {
    
            private String content;
            private String currentElement;
            private boolean insideElement = false;
            private Attributes attribs;
            private LinkedHashMultimap dataMap;
            private String[] headerArray;
    
            public DataHandler() {
                super();
            }
    
            @Override
            public void startElement(String uri, String name,
                    String qName, Attributes atts) {
                if ("item".equalsIgnoreCase(qName)) {
                    dataMap = LinkedHashMultimap.create();
                }
                currentElement = qName;
                content = null;
                insideElement = true;
                attribs = atts;
            }
    
            @Override
            public void endElement(String uri, String name, String qName) {
                if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                    if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                        dataMap.put(qName, content);
                    }
                    if (attribs != null) {
                        int attsLength = attribs.getLength();
                        if (attsLength > 0) {
                            for (int i = 0; i < attsLength; i++) {
                                String attName = attribs.getLocalName(i);
                                dataMap.put(attName, attribs.getValue(i));
                            }
                        }
                    }
                }
                if ("item".equalsIgnoreCase(qName)) {
                    String data[] = new String[headerArray.length];
                    int i = 0;
                    for (String h : headerArray) {
                        if (dataMap.containsKey(h)) {
                            Object[] values = dataMap.get(h).toArray();
                            data[i] = (String) values[0];
                            if (values.length > 1) {
                                dataMap.removeAll(h);
                                for (int j = 1; j < values.length; j++) {
                                    dataMap.put(h, values[j]);
                                }
                            } else {
                                dataMap.removeAll(h);
                            }
                        } else {
                            data[i] = "";
                        }
                        i++;
                    }
                    StringBuilder sb = new StringBuilder();
                    for (String d : data) {
                        sb.append(d);
                        sb.append(',');
                    }
                    System.out.println(sb.substring(0, sb.length() - 1));
                }
                insideElement = false;
                currentElement = null;
                attribs = null;
            }
    
            @Override
            public void characters(char ch[], int start, int length) {
                if (insideElement) {
                    content = new String(ch, start, length);
                }
            }
    
            public void setHeaderArray(String[] headerArray) {
                this.headerArray = headerArray;
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-05 01:49

    Here some code that implements the conversion of the XML to CSV using StAX. Although the XML you gave is only an example, I hope that this shows you how to handle the optional elements.

    import javax.xml.stream.XMLInputFactory;
    import javax.xml.stream.XMLStreamConstants;
    import javax.xml.stream.XMLStreamException;
    import javax.xml.stream.XMLStreamReader;
    import java.io.*;
    
    public class App 
    {
        public static void main( String[] args ) throws XMLStreamException, FileNotFoundException
        {
            new App().convertXMLToCSV(new BufferedInputStream(new FileInputStream(args[0])), new BufferedOutputStream(new FileOutputStream(args[1])));
        }
    
        static public final String ROOT = "root";
        static public final String ITEM = "Item";
        static public final String ITEM_ID = "ItemID";
        static public final String ITEM_DETAILS = "ListingDetails";
        static public final String START_TIME = "StartTime";
        static public final String END_TIME = "EndTime";
        static public final String ITEM_URL = "ViewItemURL";
        static public final String AVERAGES = "averages";
        static public final String AVERAGE_TIME = "AverageTime";
        static public final String AVERAGE_PRICE = "AveragePrice";
        static public final String SEPARATOR = ",";
    
        public void convertXMLToCSV(InputStream in, OutputStream out) throws XMLStreamException
        {
            PrintWriter writer = new PrintWriter(out);
            XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(in);
            convertXMLToCSV(xmlStreamReader, writer);
        }
    
        public void convertXMLToCSV(XMLStreamReader xmlStreamReader, PrintWriter writer) throws XMLStreamException {
            writer.println("ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice");
            xmlStreamReader.nextTag();
            xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ROOT);
    
            while (xmlStreamReader.hasNext()) {
                xmlStreamReader.nextTag();
                if (xmlStreamReader.isEndElement())
                    break;
    
                xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM);
                String itemID = nextValue(xmlStreamReader, ITEM_ID);
                xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM_DETAILS);
                String startTime = nextValue(xmlStreamReader, START_TIME);
                xmlStreamReader.nextTag();
                String averageTime = null;
                String averagePrice = null;
    
                if (xmlStreamReader.getLocalName().equals(AVERAGES))
                {
                    averageTime = nextValue(xmlStreamReader, AVERAGE_TIME);
                    averagePrice = nextValue(xmlStreamReader, AVERAGE_PRICE);
                    xmlStreamReader.nextTag();
                    xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, AVERAGES);
                    xmlStreamReader.nextTag();
                }
                String endTime = currentValue(xmlStreamReader, END_TIME);
                String url = nextValue(xmlStreamReader,ITEM_URL);
                xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM_DETAILS);
                xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM);
    
                writer.append(esc(itemID)).append(SEPARATOR)
                        .append(esc(startTime)).append(SEPARATOR)
                        .append(esc(endTime)).append(SEPARATOR)
                        .append(esc(url));
                if (averageTime!=null)
                    writer.append(SEPARATOR).append(esc(averageTime)).append(SEPARATOR)
                            .append(esc(averagePrice));
                writer.println();                        
            }
    
            xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ROOT);
            writer.close();
    
        }
    
        private String esc(String string) {
            if (string.indexOf(',')!=-1)
                string = '"'+string+'"';
            return string;
        }
    
        private String nextValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
            xmlStreamReader.nextTag();
            return currentValue(xmlStreamReader, name);
        }
    
        private String currentValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
            xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, name);
            String value = "";
            for (;;) {
                int next = xmlStreamReader.next();
                if (next==XMLStreamConstants.CDATA||next==XMLStreamConstants.SPACE||next==XMLStreamConstants.CHARACTERS)
                    value += xmlStreamReader.getText();
                else if (next==XMLStreamConstants.END_ELEMENT)
                    break;
                // ignore comments, PIs, attributes
            }
            xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, name);
            return value.trim();
        }    
    }
    
    0 讨论(0)
  • 2020-12-05 01:52

    Note that this would be a prime example of using XSLT except that most XSLT processors read in the whole XML file into memory which is not an option as it is large. Note, however, that the enterprise version of Saxon can do streaming XSLT processing (if the XSLT script adheres to the restrictions).

    You may also want to use an external XSLT processor outside your JVM instead, if applicable. This opens up for several more options.

    Streaming in Saxon-EE: http://www.saxonica.com/documentation/sourcedocs/serial.html

    0 讨论(0)
  • 2020-12-05 01:53

    You could use XStream (http://x-stream.github.io/) or JOX (http://www.wutka.com/jox.html) to recognize xml and then convert it to a Java Bean. I think you can convert the Beans to CSV automatically once you get the bean.

    0 讨论(0)
  • 2020-12-05 01:55

    I'm not sure I understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the <category> element in your example? How much different format do you need to handle? Do you really not know what the format can be (even if some element can be ommited)? Can you clarify?

    To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).


    If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result List to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.

    You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.

    They offer 30 days trial versions, have a look.


    Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:

    • the Spring Batch Documentation
    • the Samples (especially the trade sample)
    • A first look at Spring-Batch, part 2

    You could also use Smooks to do XML to CSV transformations. See also:

    • Structured Event Streaming with Smooks

    Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at:

    • Introduction to VTD - XML
    • Schemaless Java-XML Data Binding with VTD-XML
    0 讨论(0)
  • 2020-12-05 02:01

    The best way to code based on your described requirement is to use the easy feature of FreeMarker and XML processing. See the docs.

    In this case you will only need the template that will produce a CSV.

    An alternative to this is XMLGen, but very similar in approach. Just look at that diagram and examples, and instead of SQL statements, you will output CSV.

    These two similar approaches are not "conventional" but do the job very quickly for your situation, and you don't have to learn XSL (quite hard to master I think).

    0 讨论(0)
提交回复
热议问题