Wikipedia : Java library to remove wikipedia text markup removal

前端 未结 5 544
灰色年华
灰色年华 2020-12-19 04:10

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I

5条回答
  •  庸人自扰
    2020-12-19 05:12

    Do it in two steps:

    1. let some existing tool convert the MediaWiki mark-up into plain HTML;
    2. convert the plain HTML into text.

    The following demo:

    import net.java.textilej.parser.MarkupParser;
    import net.java.textilej.parser.builder.HtmlDocumentBuilder;
    import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
    import javax.swing.text.html.HTMLEditorKit;
    import javax.swing.text.html.parser.ParserDelegator;
    import java.io.StringReader;
    import java.io.StringWriter;
    
    public class Test {
    
        public static void main(String[] args) throws Exception {
    
            String markup = "This is ''italic'' and '''that''' is bold. \n"+
                    "=Header 1=\n"+
                    "a list: \n* item A \n* item B \n* item C";
    
            StringWriter writer = new StringWriter();
    
            HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
            builder.setEmitAsDocument(false);
    
            MarkupParser parser = new MarkupParser(new MediaWikiDialect());
            parser.setBuilder(builder);
            parser.parse(markup);
    
            final String html = writer.toString();
            final StringBuilder cleaned = new StringBuilder();
    
            HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                    public void handleText(char[] data, int pos) {
                        cleaned.append(new String(data)).append(' ');
                    }
            };
            new ParserDelegator().parse(new StringReader(html), callback, false);
    
            System.out.println(markup);
            System.out.println("---------------------------");
            System.out.println(html);
            System.out.println("---------------------------");
            System.out.println(cleaned);
        }
    }
    

    produces:

    This is ''italic'' and '''that''' is bold. 
    =Header 1=
    a list: 
    * item A 
    * item B 
    * item C
    ---------------------------
    

    This is italic and that is bold.

    Header 1

    a list:

    • item A
    • item B
    • item C
    --------------------------- This is italic and that is bold. Header 1 a list: item A item B item C

    Where do you download the java packages you are importing?

    Here: Web Archive link of download.java.net/maven/2/net/java/textile-j/2.2

提交回复
热议问题