Convert HTML to plain text in JS without browser environment

前端 未结 6 1356
孤街浪徒
孤街浪徒 2020-12-29 23:59

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environmen

相关标签:
6条回答
  • 2020-12-30 00:10

    Converter HTML to plain text like Gmail:

    html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
    html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
    html = html.replace(/<\/div>/ig, '\n');
    html = html.replace(/<\/li>/ig, '\n');
    html = html.replace(/<li>/ig, '  *  ');
    html = html.replace(/<\/ul>/ig, '\n');
    html = html.replace(/<\/p>/ig, '\n');
    html = html.replace(/<br\s*[\/]?>/gi, "\n");
    html = html.replace(/<[^>]+>/ig, '');
    

    If you can use jQuery :

    var html = jQuery('<div>').html(html).text();
    
    0 讨论(0)
  • 2020-12-30 00:25

    It's pretty simple, you can also implement a "toText" prototype:

    String.prototype.toText = function(){
        return $(html).text();
    };
    
    //Let's test it out!
    var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
    var text = html.toText();
    console.log("Text: " + text); //Result will be "link TEXT"
    
    0 讨论(0)
  • 2020-12-30 00:29

    Updated @EpokK answer for html to email text version use-case

    const htmltoText = (html: string) => {
      let text = html;
      text = text.replace(/\n/gi, "");
      text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
      text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
      text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
      text = text.replace(/<\/div>/gi, "\n\n");
      text = text.replace(/<\/li>/gi, "\n");
      text = text.replace(/<li.*?>/gi, "  *  ");
      text = text.replace(/<\/ul>/gi, "\n\n");
      text = text.replace(/<\/p>/gi, "\n\n");
      text = text.replace(/<br\s*[\/]?>/gi, "\n");
      text = text.replace(/<[^>]+>/gi, "");
      text = text.replace(/^\s*/gim, "");
      text = text.replace(/ ,/gi, ",");
      text = text.replace(/ +/gi, " ");
      text = text.replace(/\n+/gi, "\n\n");
      return text;
    };
    
    
    0 讨论(0)
  • 2020-12-30 00:31

    You can try this way. textContent with innerText neither of them compatible with all browsers:

    var temp = document.createElement("div");
    temp.innerHTML = html;
    return temp.textContent || temp.innerText || "";
    
    0 讨论(0)
  • 2020-12-30 00:35

    This regular expression works:

    text.replace(/<[^>]*>/g, '');
    
    0 讨论(0)
  • 2020-12-30 00:35

    With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

    In node.js it looks like:

    var createTextVersion = require("textversionjs");
    var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
    
    var textVersion = createTextVersion(yourHtml);
    

    (I copied the example from the page, you will have to npm install the module first.)

    0 讨论(0)
提交回复
热议问题