Wikipedia : Java library to remove wikipedia text markup removal

前端 未结 5 539
灰色年华
灰色年华 2020-12-19 04:10

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I

5条回答
  •  [愿得一人]
    2020-12-19 05:12

    If you need plain text you should use WikiClean library https://github.com/lintool/wikiclean.

    I had the same problem and it looks like this was the only efficient solution that worked for me in java.

    There are two usecases:

    1) When you have the text not in XML format then you should add xml tags needed to do this processing. Supposing you are processing XML file earlier, and now you have the content without XML structure, then you just add xmlStartTag and xmlEndTag as in the code bellow, and it processes it.

    String xmlStartTag = "";
    String xmlEndTag = "";
    String articleWithXml = xmlStartTag + article.getText() + xmlEndTag;
    WikiClean cleaner = new WikiClean.Builder().build();
    String plainWikiText = cleaner.clean(articleWithXml);
    

    2) When you are reading the Wikipedia dump file directly (xml file), in that case you just pass it through the file and it goes through.

    WikiClean cleaner = new WikiClean.Builder().build();
    String plainWikiText = cleaner.clean(XMLFileContents);
    

提交回复
热议问题