I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environmen
Converter HTML to plain text like Gmail:
html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, ' * ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');
If you can use jQuery
:
var html = jQuery('<div>').html(html).text();
It's pretty simple, you can also implement a "toText" prototype:
String.prototype.toText = function(){
return $(html).text();
};
//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a> <br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"
Updated @EpokK answer for html to email text version use-case
const htmltoText = (html: string) => {
let text = html;
text = text.replace(/\n/gi, "");
text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
text = text.replace(/<\/div>/gi, "\n\n");
text = text.replace(/<\/li>/gi, "\n");
text = text.replace(/<li.*?>/gi, " * ");
text = text.replace(/<\/ul>/gi, "\n\n");
text = text.replace(/<\/p>/gi, "\n\n");
text = text.replace(/<br\s*[\/]?>/gi, "\n");
text = text.replace(/<[^>]+>/gi, "");
text = text.replace(/^\s*/gim, "");
text = text.replace(/ ,/gi, ",");
text = text.replace(/ +/gi, " ");
text = text.replace(/\n+/gi, "\n\n");
return text;
};
You can try this way. textContent
with innerText
neither of them compatible with all browsers:
var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";
This regular expression works:
text.replace(/<[^>]*>/g, '');
With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.
In node.js it looks like:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
(I copied the example from the page, you will have to npm install the module first.)