Jsoup: How to get all html between 2 header tags

前端 未结 3 1795
-上瘾入骨i
-上瘾入骨i 2020-12-04 00:16

I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.

Appreciate any help.

T

3条回答
  •  无人及你
    2020-12-04 01:09

    Iterating over the elements between consecutive elements seems to be fine, except one thing. Text not belonging to any tag, like in

    this

    . To workaround this I implemented splitElemText function to get this text. First split whole parent element using this method. Then except the element, process the suitable entry from the splitted text. Remove calls to htmlToText if you want raw html.

    /** Splits the text of the element elem by the children
      * tags.
      * @return An array of size c+1, where c
      * is the number of child elements.
      * 

    Text after nth element is found in [n+1]. */ public static String[] splitElemText(Element elem) { int c = elem.children().size(); String as[] = new String[c + 1]; String sAll = elem.html(); int iBeg = 0; int iChild = 0; for (Element ch : elem.children()) { String sChild = ch.outerHtml(); int iEnd = sAll.indexOf(sChild, iBeg); if (iEnd < 0) { throw new RuntimeException("Tag " + sChild +" not found in its parent: " + sAll); } as[iChild] = htmlToText(sAll.substring(iBeg, iEnd)); iBeg = iEnd + sChild.length(); iChild += 1; } as[iChild] = htmlToText(sAll.substring(iBeg)); assert(iChild == c); return as; } public static String htmlToText(String sHtml) { Document doc = Jsoup.parse(sHtml); return doc.text(); }

提交回复
热议问题