What is the most convenient way to convert HTML to plain text while preserving line breaks (with JavaScript)?

前端 未结 5 1958
长发绾君心
长发绾君心 2020-12-10 04:24

Basically I just need the effect of copying that HTML from browser window and pasting it in a textarea element.

For example I want this:

Som

相关标签:
5条回答
  • 2020-12-10 04:37

    Three steps.

    First get the html as a string.
    Second, replace all <BR /> and <BR> with \r\n.
    Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".
    
    0 讨论(0)
  • 2020-12-10 04:39

    I made a function based on this answer: https://stackoverflow.com/a/42254787/3626940

    function htmlToText(html){
        //remove code brakes and tabs
        html = html.replace(/\n/g, "");
        html = html.replace(/\t/g, "");
    
        //keep html brakes and tabs
        html = html.replace(/<\/td>/g, "\t");
        html = html.replace(/<\/table>/g, "\n");
        html = html.replace(/<\/tr>/g, "\n");
        html = html.replace(/<\/p>/g, "\n");
        html = html.replace(/<\/div>/g, "\n");
        html = html.replace(/<\/h>/g, "\n");
        html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");
    
        //parse html into text
        var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
        return dom.body.textContent;
    }
    
    0 讨论(0)
  • 2020-12-10 04:47

    Based on chrmcpn answer, I had to convert a basic HTML email template into a plain text version as part of a build script in node.js. I had to use JSDOM to make it work, but here's my code:

    const htmlToText = (html) => {
        html = html.replace(/\n/g, "");
        html = html.replace(/\t/g, "");
    
        html = html.replace(/<\/p>/g, "\n\n");
        html = html.replace(/<\/h1>/g, "\n\n");
        html = html.replace(/<br>/g, "\n");
        html = html.replace(/<br( )*\/>/g, "\n");
    
        const dom = new JSDOM(html);
        let text = dom.window.document.body.textContent;
    
        text = text.replace(/  /g, "");
        text = text.replace(/\n /g, "\n");
        text = text.trim();
        return text;
    }
    
    0 讨论(0)
  • 2020-12-10 04:50

    I tried to find some code I wrote for this a while back that I used. It worked nicely. Let me outline what it did, and hopefully you could duplicate its behavior.

    • Replace images with alt or title text.
    • Replace links with "text[link]"
    • Replace things that generally produce vertical white space. h1-h6, div, p, br, hr, etc. (I know, I know. These could actually be inline elements, but it works out well.)
    • Strip out the rest of the tags and replace with an empty string.

    You could even expand this more to format things like ordered and unordered lists. It really just depends on how far you'll want to go.

    EDIT

    Found the code!

    public static string Convert(string template)
    {
        template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", "$1"); /* Use image alt text. */
        template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", "$2 [$1]"); /* Convert links to something useful */
        template = Regex.Replace(template, "<(/p|/div|/h\\d|br)\\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
        template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */
    
        return template;
    }
    
    0 讨论(0)
  • 2020-12-10 04:59

    If that HTML is visible within your web page, you could do it with the user selection (or just a TextRange in IE). This does preserve line breaks, if not necessarily leading and trailing white space.

    UPDATE 10 December 2012

    However, the toString() method of Selection objects is not yet standardized and works inconsistently between browsers, so this approach is based on shaky ground and I don't recommend using it now. I would delete this answer if it weren't accepted.

    Demo: http://jsfiddle.net/wv49v/

    Code:

    function getInnerText(el) {
        var sel, range, innerText = "";
        if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
            range = document.body.createTextRange();
            range.moveToElementText(el);
            innerText = range.text;
        } else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
            sel = window.getSelection();
            sel.selectAllChildren(el);
            innerText = "" + sel;
            sel.removeAllRanges();
        }
        return innerText;
    }
    
    0 讨论(0)
提交回复
热议问题