Parsing XML file containing HTML entities in Java without changing the XML

前端 未结 6 1279
一个人的身影
一个人的身影 2020-12-05 18:53

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as , > and so forth. I

相关标签:
6条回答
  • 2020-12-05 19:13

    I made yesterday something similar i need to add value from unziped XML in stream to database.

    //import I'm not sure if all are necessary :) 
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import javax.xml.parsers.ParserConfigurationException;
    import javax.xml.xpath.*;
    import org.w3c.dom.Document;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    
    //I didnt checked this code now because i'm in work for sure its work maybe 
    you will need to do little changes
    InputSource is = new InputSource(new FileInputStream("test.xml"));
    
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(is);
    XPathFactory xpf = XPathFactory.newInstance();
    XPath xpath = xpf.newXPath();
    String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
    ParsingHexToChar.parseToChar(words);
    
    // lib which i use common-lang3.jar
    //metod to parse 
    public static String parseToChar( String words){
    
        String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
    
            return decode;
     }
    
    0 讨论(0)
  • 2020-12-05 19:19

    Try this using org.apache.commons package :

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder parser = dbf.newDocumentBuilder();
    
    InputStream in = new FileInputStream(xmlfile);    
    String unescapeHtml4 = IOUtils.toString(in);
    
    CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
              new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
             );
    
    unescapeHtml4 = obj.translate(unescapeHtml4);
    StringReader readerInput= new StringReader(unescapeHtml4);
    
    InputSource is = new InputSource(readerInput);
    Document doc    = parser.parse(is);    
    
    0 讨论(0)
  • 2020-12-05 19:23

    Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —

    XML has only five predefined entities. The —,   is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)

    Issue - 2: I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

    Streaming API for XML, called StaX, is an API for reading and writing XML Documents.

    StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.

    The core StaX API falls into two categories and they are listed below. They are

    • Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events

    • Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.

    STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:

    Requires the parser to replace internal entity references with their replacement text and report them as characters

    This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.

    However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.

    You may try it. Hope it will solve your issue. For your case,

    Main.java

    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    
    import javax.xml.stream.XMLEventReader;
    import javax.xml.stream.XMLInputFactory;
    import javax.xml.stream.XMLStreamException;
    import javax.xml.stream.events.EntityReference;
    import javax.xml.stream.events.XMLEvent;
    
    public class Main {
    
        public static void main(String[] args) {
            XMLInputFactory inputFactory = XMLInputFactory.newInstance();
            inputFactory.setProperty(
                    XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
            XMLEventReader reader;
            try {
                reader = inputFactory
                        .createXMLEventReader(new FileInputStream("F://test.xml"));
                while (reader.hasNext()) {
                    XMLEvent event = reader.nextEvent();
                    if (event.isEntityReference()) {
                        EntityReference ref = (EntityReference) event;
                        System.out.println("Entity Reference: " + ref.getName());
                    }
                }
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (XMLStreamException e) {
                e.printStackTrace();
            }
        }
    }
    

    test.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <foo>
        <bar>Some&nbsp;text &mdash; invalid!</bar>
    </foo>
    

    Output:

    Entity Reference: nbsp

    Entity Reference: mdash

    Credit goes to @skaffman.

    Related Link:

    1. http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
    2. http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
    3. http://www.vogella.com/tutorials/JavaXML/article.html
    4. Is there a Java XML API that can parse a document without resolving character entities?

    UPDATE:

    Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them with something else, for example) and still produce a Document at the end of the process?

    To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.

    There are 5 methods of XMLStreamWriter for document.

    1. xmlsw.writeStartDocument(); - initialises an empty document to which elements can be added
    2. xmlsw.writeStartElement(String s) -creates a new element named s
    3. xmlsw.writeAttribute(String name, String value)- adds the attribute name with the corresponding value to the last element produced by a call to writeStartElement. It is possible to add attributes as long as no call to writeElementStart,writeCharacters or writeEndElement has been done.
    4. xmlsw.writeEndElement - close the last started element
    5. xmlsw.writeCharacters(String s) - creates a new text node with content s as content of the last started element.

    A sample example is attached with it:

    StAXExpand.java

    import  java.io.BufferedReader;
    import  java.io.FileReader;
    import  java.io.IOException;
    
    import javax.xml.stream.XMLOutputFactory;
    import javax.xml.stream.XMLStreamException;
    import javax.xml.stream.XMLStreamWriter;
    
    import java.util.Arrays;
    
    public class StAXExpand {   
        static XMLStreamWriter xmlsw = null;
        public static void main(String[] argv) {
            try {
                xmlsw = XMLOutputFactory.newInstance()
                              .createXMLStreamWriter(System.out);
                CompactTokenizer tok = new CompactTokenizer(
                              new FileReader(argv[0]));
    
                String rootName = "dummyRoot";
                // ignore everything preceding the word before the first "["
                while(!tok.nextToken().equals("[")){
                    rootName=tok.getToken();
                }
                // start creating new document
                xmlsw.writeStartDocument();
                ignorableSpacing(0);
                xmlsw.writeStartElement(rootName);
                expand(tok,3);
                ignorableSpacing(0);
                xmlsw.writeEndDocument();
    
                xmlsw.flush();
                xmlsw.close();
            } catch (XMLStreamException e){
                System.out.println(e.getMessage());
            } catch (IOException ex) {
                System.out.println("IOException"+ex);
                ex.printStackTrace();
            }
        }
    
        public static void expand(CompactTokenizer tok, int indent) 
            throws IOException,XMLStreamException {
            tok.skip("["); 
            while(tok.getToken().equals("@")) {// add attributes
                String attName = tok.nextToken();
                tok.nextToken();
                xmlsw.writeAttribute(attName,tok.skip("["));
                tok.nextToken();
                tok.skip("]");
            }
            boolean lastWasElement=true; // for controlling the output of newlines 
            while(!tok.getToken().equals("]")){ // process content 
                String s = tok.getToken().trim();
                tok.nextToken();
                if(tok.getToken().equals("[")){
                    if(lastWasElement)ignorableSpacing(indent);
                    xmlsw.writeStartElement(s);
                    expand(tok,indent+3);
                    lastWasElement=true;
                } else {
                    xmlsw.writeCharacters(s);
                    lastWasElement=false;
                }
            }
            tok.skip("]");
            if(lastWasElement)ignorableSpacing(indent-3);
            xmlsw.writeEndElement();
       }
    
        private static char[] blanks = "\n".toCharArray();
        private static void ignorableSpacing(int nb) 
            throws XMLStreamException {
            if(nb>blanks.length){// extend the length of space array 
                blanks = new char[nb+1];
                blanks[0]='\n';
                Arrays.fill(blanks,1,blanks.length,' ');
            }
            xmlsw.writeCharacters(blanks, 0, nb+1);
        }
    
    }
    

    CompactTokenizer.java

    import  java.io.Reader;
    import  java.io.IOException;
    import  java.io.StreamTokenizer;
    
    public class CompactTokenizer {
        private StreamTokenizer st;
    
        CompactTokenizer(Reader r){
            st = new StreamTokenizer(r);
            st.resetSyntax(); // remove parsing of numbers...
            st.wordChars('\u0000','\u00FF'); // everything is part of a word
                                             // except the following...
            st.ordinaryChar('\n');
            st.ordinaryChar('[');
            st.ordinaryChar(']');
            st.ordinaryChar('@');
        }
    
        public String nextToken() throws IOException{
            st.nextToken();
            while(st.ttype=='\n'|| 
                  (st.ttype==StreamTokenizer.TT_WORD && 
                   st.sval.trim().length()==0))
                st.nextToken();
            return getToken();
        }
    
        public String getToken(){
            return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
        }
    
        public String skip(String sym) throws IOException {
            if(getToken().equals(sym))
                return nextToken();
            else
                throw new IllegalArgumentException("skip: "+sym+" expected but"+ 
                                                   sym +" found ");
        }
    }
    

    For more, you can follow the tutorial

    1. https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
    2. http://www.ibm.com/developerworks/library/x-tipstx2/index.html
    3. http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
    4. http://staf.sourceforge.net/current/STAXDoc.pdf
    0 讨论(0)
  • 2020-12-05 19:28

    I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

    public static void main(String args[]){
    
    
        String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                      "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
        Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    
        for (Element e : doc.select("bar")) {
            System.out.println(e);
        }   
    
    
    }
    

    Result:

    <bar>
     Some&nbsp;text — invalid!
    </bar>
    

    Loading from a file can be found here:

    http://jsoup.org/cookbook/input/load-document-from-file

    0 讨论(0)
  • 2020-12-05 19:32

    Another approach, since you're not using a rigid OXM approach anyway. You might want to try using a less rigid parser such as JSoup? This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

    0 讨论(0)
  • 2020-12-05 19:34

    Just to throw in a different approach to a solution:

    You might envelope your input stream with a stream inplementation that replaces the entities by something legal.

    While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
    Not as elegant and clean as a xml framework internal solution, though.

    0 讨论(0)
提交回复
热议问题