Extract text from HTML while preserving block-level element newlines

后端 未结 5 1634
难免孤独
难免孤独 2020-12-13 18:28

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text();

While this

5条回答
  •  情话喂你
    2020-12-13 19:05

    This seems to be (nearly) doing what you want:

    function getText($node) {
        return $node.contents().map(function () {
            if (this.nodeName === 'BR') {
                return '\n';
            } else if (this.nodeType === 3) {
                return this.nodeValue;
            } else {
                return getText($(this));
            }
        }).get().join('');
    }
    

    DEMO

    It just recursively concatenates the values of all text nodes and replaces
    elements with line breaks.

    But there is no semantics in this, it completely relies the original HTML formatting (the leading and trailing white spaces seem to come from how jsFiddle embeds the HTML, but you can easily trim those). For example, notice how it indents the definition term.

    If you really want to do this on a semantic level, you need a list of block level elements, recursively iterate over the elements and indent them accordingly. You treat different block elements differently with respect to indentation and line breaks around them. This should not be too difficult.

提交回复
热议问题