I am trying to get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.
Appreciate any help.
T
Iterating over the elements between consecutive elements seems to be fine, except one thing. Text not belonging to any tag, like in this. To workaround this I implemented splitElemText function to get this text. First split whole parent element using this method. Then except the element, process the suitable entry from the splitted text. Remove calls to htmlToText if you want raw html.
/** Splits the text of the element* is the number of child elements. *elemby the children * tags. * @return An array of sizec+1, wherec Text after
nth element is found in[n+1]. */ public static String[] splitElemText(Element elem) { int c = elem.children().size(); String as[] = new String[c + 1]; String sAll = elem.html(); int iBeg = 0; int iChild = 0; for (Element ch : elem.children()) { String sChild = ch.outerHtml(); int iEnd = sAll.indexOf(sChild, iBeg); if (iEnd < 0) { throw new RuntimeException("Tag " + sChild +" not found in its parent: " + sAll); } as[iChild] = htmlToText(sAll.substring(iBeg, iEnd)); iBeg = iEnd + sChild.length(); iChild += 1; } as[iChild] = htmlToText(sAll.substring(iBeg)); assert(iChild == c); return as; } public static String htmlToText(String sHtml) { Document doc = Jsoup.parse(sHtml); return doc.text(); }