发表新帖

发表新帖

Extract text from HTML while preserving block-level element newlines

后端未结

关注

 5  1634

难免孤独 2020-12-13 18:28

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text();

While this

5条回答

情话喂你 (楼主)

2020-12-13 19:05
This seems to be (nearly) doing what you want:
```
function getText($node) {
    return $node.contents().map(function () {
        if (this.nodeName === 'BR') {
            return '\n';
        } else if (this.nodeType === 3) {
            return this.nodeValue;
        } else {
            return getText($(this));
        }
    }).get().join('');
}
```
DEMO

It just recursively concatenates the values of all text nodes and replaces elements with line breaks.

But there is no semantics in this, it completely relies the original HTML formatting (the leading and trailing white spaces seem to come from how jsFiddle embeds the HTML, but you can easily trim those). For example, notice how it indents the definition term.

If you really want to do this on a semantic level, you need a list of block level elements, recursively iterate over the elements and indent them accordingly. You treat different block elements differently with respect to indentation and line breaks around them. This should not be too difficult.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题