jsoup - strip all formatting and link tags, keep text only

前端 未结 3 1352
太阳男子
太阳男子 2020-12-08 09:40

Let\'s say i have a html fragment like this:

foo bar foobar baz

<
相关标签:
3条回答
  • 2020-12-08 10:13

    With Jsoup:

    final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
    Document doc = Jsoup.parse(html);
    
    System.out.println(doc.text());
    

    Output:

    foo bar foobar baz
    

    If you want only the text of p-tag, use this instead of doc.text():

    doc.select("p").text();
    

    ... or only body:

    doc.body().text();
    

    Linebreak:

    final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
            + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
    Document doc = Jsoup.parse(html);
    
    for( Element element : doc.select("p") )
    {
        System.out.println(element.text());
        // eg. you can use a StringBuilder and append lines here ...
    }
    

    Output:

    Tarthatatlan biztonsági viszonyok  
    Tarthatatlan biztonsági viszonyok
    
    0 讨论(0)
  • 2020-12-08 10:23

    Actually, the correct way to clean with Jsoup is through a Whitelist

    ...
    final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
    Document doc = Jsoup.parse(html);
    Whitelist wl = new Whitelist().none()
    String cleanText = new Jsoup().clean(doc ,wl)
    

    If you want to still preserve some tags:

    Whitelist wl = new Whitelist().relaxed().removeTags("a")
    
    0 讨论(0)
  • 2020-12-08 10:31

    Using Regex: -

    String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
    str = str.replaceAll("<[^>]*>", "");
    System.out.println(str);
    

    OUTPUT: -

      foo   bar  foobar  baz 
    

    Using Jsoup: -

    Document doc = Jsoup.parse(str); 
    String text = doc.text();
    
    0 讨论(0)
提交回复
热议问题