问题
I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environment to convert HTML to plain text.
Currently I use this multi-stage regexp
html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
.replace(/<script([\s\S]*?)<\/script>/gi, ' ')
.replace(/(<(?:.|\n)*?>)/gm, ' ')
.replace(/\s+/gm, ' ');
while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?
回答1:
Converter HTML to plain text like Gmail:
html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, ' * ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');
If you can use jQuery
:
var html = jQuery('<div>').html(html).text();
回答2:
This regular expression works:
text.replace(/<[^>]*>/g, '');
回答3:
With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.
In node.js it looks like:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
(I copied the example from the page, you will have to npm install the module first.)
回答4:
You can try this way. textContent
with innerText
neither of them compatible with all browsers:
var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";
回答5:
It's pretty simple, you can also implement a "toText" prototype:
String.prototype.toText = function(){
return $(html).text();
};
//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a> <br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"
来源:https://stackoverflow.com/questions/15180173/convert-html-to-plain-text-in-js-without-browser-environment