Discard html tags within custom tags while getting text in XHTML using SAX Parser in Groovy

前端 未结 2 855
萌比男神i
萌比男神i 2021-01-24 21:56

So I am trying to get the text between the tags. So far I have been successful. But sometimes when there are special characters or html tags inside my custom tags I am unable to

相关标签:
2条回答
  • 2021-01-24 22:19

    I'm not familiar with Groovy so here is a solution in Java. I believe the translation is straighforward.

    import java.io.FileInputStream;
    import java.io.InputStream;
    import java.util.ArrayList;
    
    import javax.xml.parsers.SAXParser;
    import javax.xml.parsers.SAXParserFactory;
    
    import org.xml.sax.Attributes;
    import org.xml.sax.helpers.DefaultHandler;
    
    public class SaxHandler extends DefaultHandler {
        ArrayList<String> DefinedTermTitles = new ArrayList<>();
        ArrayList<String> ClauseTitles = new ArrayList<>();
        String currentMessage;
        boolean countryFlag = false;
        StringBuilder message = new StringBuilder();
    
        public void startElement(String ns, String localName, String qName, Attributes atts) {
            switch (qName) {
                case "ae_clauseTitleBegin":
                    countryFlag = true;
                    break;
    
                case "ae_definedTermTitleBegin":
                    countryFlag = true; 
                    break;           
             }      
        }   
    
        public void characters(char[] chars, int offset, int length) {
            if (countryFlag) {
                message.append(new String(chars, offset, length));
            }
        }
    
        public void endElement(String ns, String localName, String qName) {
            switch (qName) {        
                case "ae_clauseTitleEnd":
                    ClauseTitles.add(message.toString());
                    countryFlag = false;
                    message.setLength(0);
                    break;
    
                case "ae_definedTermTitleEnd":
                    DefinedTermTitles.add(message.toString());
                    countryFlag = false; 
                    message.setLength(0);
                    break;
             }
        }
    
        public static void main (String argv []) {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            try {
                String path = "INPUT_PATH_HERE";
                InputStream xmlInput = new FileInputStream(path + "test.xml");
                SAXParser saxParser = factory.newSAXParser();
                SaxHandler handler   = new SaxHandler();
                saxParser.parse(xmlInput, handler);
    
                System.out.println(handler.DefinedTermTitles);
                System.out.println(handler.ClauseTitles);
    
            } catch (Exception err) {
                err.printStackTrace ();
            }
        }
    }
    

    Output

    [Australia, Isle of Man, France]
    [1.02 Accounting Terms., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]
    
    0 讨论(0)
  • 2021-01-24 22:19

    Since you have been asking this question now for different libraries, here is a solution with XMLParser. The author of this XML had maybe not the best understanding how XML works. If I where you I'd rather put some filtering in place, to make this sane again (e.g. <tagBegin/>X<tagEnd/> to <tag>x</tag>).

    def xml = '''\
    <records>
        <car name='HSV Maloo' make='Holden' year='2006'>
            <ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
            <ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
        </car>
        <car name='P50' make='Peel' year='1962'>
            <ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
            <ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
        </car>
        <car name='Royale' make='Bugatti' year='1931'>
            <ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
            <ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
        </car>
    </records>
    '''
    
    def underp = { l ->
        l.inject([texts: [:]]) { r, it ->
            if (it.respondsTo('name') && it.name().endsWith('Begin')) {
                r.texts[(r.last=it.name().replaceFirst(/Begin$/,''))] = ''
            } else if (it.respondsTo('name') && it.name().endsWith('End')) {
                r.last = null
            } else if (r.last) {
                r.texts[r.last] += (it instanceof String) ? it : it.text()
            }
            r
        }.texts
    }
    
    def root = new XmlParser().parseText(xml)
    root.car.each{
        println underp(it.children()).inspect()
    }
    

    prints

    ['ae_definedTermTitle':'Australia', 'ae_clauseTitle':'1.02 Accounting Terms.']
    ['ae_definedTermTitle':'Isle of Man', 'ae_clauseTitle':'Smallest Street-Legal Car at 99cm wide and 59 kg in weight']
    ['ae_definedTermTitle':'France', 'ae_clauseTitle':'Most Valuable Car at $15 million']
    
    0 讨论(0)
提交回复
热议问题